DOCUMENT BESOHE 

ED 084 302 TM 003 317 



AUTHOR 
TITLE 

INSTITUTION 
SPONS AGENCY 

REPORT NO 
PUB DATE 
NOTE 



Betz, Nancy E.; Weiss^ David J. 

An Empirical Study of Computer- Administered Two-Stage 
Ability Testing, 

Minnesota Univ.^ Minneapolis, Dept. of Psychology, 
Office of Naval Research^ Washington^ D,C, Personn L 
and Training Research Programs Office. 
N000ia-67--A-0113-0029; PMP-73-a 
Oct 73 
60p. 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



MF-$0.65 HC-$3.29 

♦Individualized Programs; ^Individual Tests; *Item 
Analysis; *neasurement Techniques; *Predict.ive 
Ability (Testing) ; Psychometrics; Scoring 
Adaptive Testinj; *Computer Assisted Testing; 
Flexilevel Tests; Linear Tests; Response-COjntinent 
Testing; Two-Stage Testing 



ABSTRACT 

A two-stage adaptive test and a conventional peaked 
test were constructed and administered on a time-shared computer 
system to students in undergraduate psychology courses. (The 
two-stage adaptive test consisted of a routing test followed by one 
of a series of measurement tests.) Comparison of the score 
distributions showed that the two-stage test scores were more 
variable than the linear test scores; the distribution of two-stage 
scores was normal, whereas that of the linear test scores tended 
toward flatness. When considering the memory of the items, ±he 
two-stage test was found to have higher test-retest stability than 
the conventional. The relationship between the two-stage and 
conventional test scores was relatively high and primarily linear, 
but about 20% of the reliable variance in the conventional test 
scores was left unaccounted for. Further analyses of the two-stage 
test showed that the difficulty levels of the measurement tests were 
not optimal, and that 4 to 5% of the examinees were routed to 
inappropriate measurement tests. The poor internal consiste.ncy of the 
measurement tests in comparison with that of the routing and 
conventional tests was apparently due to the extreme homogeneity of 
ability within the measurement test sub-groups. The findings of the 
study were interpreted as favorable to continue exploration of 
two-stage testing. (Author/NE) 



us DEPARTMENTOF HEALTH. 
EDUCATION & WELFARE 
NATIONAL INSTITUTE OF 
EDUCATION 

THIS D0CuWrf4T HAS BEEN HCPRO 
Ol/CED EXACTLY AS RECEIVED FROM 
THE PE»SON OR OWGANiZftTlON ORIGtN 
MiNGiT POIN TS Of VlE/^ OR OPINIONS 
^"AIED DO NOT NECESSARILY HEPffE 

r\} SENT On- I CI f.L NATIONAL SNSIITUTE O* 

rOuCATiON POSITfON OM POl 'CV 

o 

^ AN EMPIRICAL STUDY OF COMPUTER- ADMINISTERED 

TWO-STAGE ABILITY TESTING 



o 

CD 



Nancy E. Betz 
and 

David J. Weiss 



Res e arch Report 73-^ 

Psychometric Methods Program 
Department of Psychology 
University of Minnesota 

October 1973 

CO 

Prepared under contract No. N0001^-67-A-0113-0029- 
NR No. 150-3^3? with the Personnel and 
Training Research Programs, Psychological Sciences Division 

Office of Naval Research 

O 




Approved for public release; distribution unlimited, 
Reproduction in whole or in part is permitted for 
any purpose of the United States Government. 



FILMED FROM BEST AVAILABLE COPY 



DOCUMENT COtUROL DATA - K £. H 











Department of Psychology 
University of Minnesota 






Unclassified 




Gr,ouP 




J f. 1 f'OH T 1 n L h 








An Empirical Study of Computer-Administered 
A]:)ility Testing 


Two- 


Stage 


Technical Report 








Avi *. Mtuj^ii (hii\f /i.'imf, ruthifr miti.ti, ta.tt narif) 








. ISTancy E. Betz and David J. Weiss 









jC ni POn T DA 7 c 

October 1973 



7ft. TOTAL NO. OF PAGES 

^9 



76. NO. OF REF5 



35 



ERIC 



I,. I, CONTRACT Of* CKANT NO. 

NOOOi^'i -67 -A-01 13-0029 

h. MhOJEC T no 

NR150-3^3 



90. ORtClNATOR'S REPORT NUMDCH(i:t 



Research Report 73-"^ 

Psy chome trie Me thod s Program 



yh. OTHER Rt,PO^'T nO<SJ (Any other iMtrnhern that may he ^^Hstfined 
thta report) 



W DIS1 HIBUTION STATEMENT 



Approved for public release; distribution uniirnitad 



M. SUPPLEMLNl ARY NOTUn 



IJ. ABSTRACT 



12. SF-ONSORING WILJTARY ACTIVITY 

Personnel & Training Research 

Programs 
Office of Naval Research 



A two-stag"e adaptive test and a conventional peaked test were con- 
structed and administered on a time-shared computer system to students 
in undergraduate psychology courses. Comparison of the score distribu- 
tions yielded by the two tests showed that the two-stage test scores 
were somewhat more variable than the linear test scores, and that the 
distribution of two-stage scores was normal, whereas that of the linear 
test scores tended toward flatness. The two-stage test had higher test- 
retest stability than the conventional when the effect of memory of the 
items was taken into account. The relationship between the two-stage 
and conventional test scores was relatively hig-h and primarily linear 
but left about 20% of the reliable variance in the conventional test 
scores unaccounted for. Farther analyses of the tv^ro-stage test showed 
th^t the difficulty levels of the measurement tests were not optimal, 
and that ^ to 3% of the testees were misclassif ied into measurement 
tests. The relatively poor internal consistency of the measurement 
tests in compari son to that of the routing test and the conventional 
test was apparently due to the extreme homogeneity of ability within 
the measurement test sub-groups. The findings of the study were in- 
terpreted as favorable t.o continued exploration of two-s tageN:es ting 
procedures. Suggestions for possible ways to improve the character- 
jistics of the two-stage testing strategy are offered. 

r./N 0101- AO/. 



(PAGf. \ ) 



Scc-i*ri1v i!i.-}.sific?{lion 



Sr»( iirity C'his.sif iriil i(in 



1 4. 

KF. Y WORDS 


LINK A 


LINK R 


LINK C 


RO L r 


W T 


no L r; 


W T 


WOL f 


W T 


te sting 

ability testing 
two-stage testing 
comput erized t es t ing 
adaptive testing 
branched testing 
individual ized te sting 
tailored testing 
programmed te s t ing 
response-contingent testing 
automated testing 












i- 



ERICD /Kr„1473 "^ACK) 

^™'™™"AGF 2) Soruiity Clos::ificn'.ion 



Contents 



Introduction and review of literature 1 

Method 5 

De sign 3 

Test development <. 8 

Item pool 8 

Two-stage test 9 

Routing test 11 

Measurement tests 13 

Scoring ih 

Conventional linear test 15 

Administration and subjects l6 

Analysis of data l6 

Characteristics of score distributions 18 

Reliability 18 

Internal consistency 18 

Stability 19 

Additional analyses 20 

Results 20 

Comparison of two-stage test and linear test on 

psychometric characteristics 20 

Variability 20 

Shape of the score distributions 20 

Reliability 23 

Internal consistency 23 

Stability 2? 

Relationships between Linear and Two-stage 

Scores 29 

Comparison of Norming and Testing Item 

Statistics 31 

Item difficulties 31 

Item discriminations 33 

Additional Characteristics of the Two-stage 

Test 3^ 

Misclassif ication 36 

Conclusions and Implications 36 

References k\ 

Appendixes 

Appendix A kk 

Appendix B 



ERIC 



AN ICMPIRICAL STUDY OF COMPUTEH- 
ADMINISTERED TWO- STAGE ABILITY TESTING 



The growth and refinement of time-sharod computer 
facilitie^s lias made it feasible to consider new approaches 
to the measurement of abilities. One such approach in- 
volves varying test J tom presentation procedures according 
to the characteristics of tlie individual being tested; this 
approach hns been referred to as sequential testing (Cron- 
bach and Gleser, \')37; Evans, 1953; Krathwohl and Huyser, 
1.956; Patei'son, L'>62), branched testing (Bayroff, 1964), 
pro^^rammed testing (Cleary, Linn, and Rock, 1968a), indi- 
vidualized measurement (Weiss, 1967), tailored testing 
(Lord, 1970)) response-contingent measurement (Wood, 1971, 
1973), and, most recently, adaptive testing ( Weiss and 
Betz, 1973). 

One model of adaptive testing is the two-stage proce- 
dure. This testing strategy consists of a routing test 
followed by one of a series of second-stage or "measurement" 
tests, each of whicli consists of items concentrated at a 
different level of difficulty. The purpose of the routing 
test is to give an initial estimate of an individual's 
ability so that he may be routed to the measurement test 
most appropriate to his ability. Cronbach & Gleser (l957) 
appear to have been the first to suggest the use of two- 
stage testing procedures. Weiss (1973) describes several 
variations of the basic two-stage strategy and compares 
them with other strategies of adaptive ability testing. 

The first reported study of the two-stage procedure 
was an empirical study by Angoff and Huddleston (1958). 
They compared two-stage procedures with conventional "broad 
range" ability tests of verbal and mathematical abilities 
from the College Entrance Examination Board's Scholastic 
Aptitude Test. The two-stage test measuring verbal abilities 
consisted of a 'lO-item routing test and two 36-item measure- 
ment tests; their two-stage mathematical abilities test 
consisted of a 3G-item routing test and two 17-item measure- 
ment tests. Nearly 6,000 students from 19 different colleges 
were tested, and all testing was timed. In the procedure 
followed, routing did not actually occur {i.e., the routing 
test was not scorcM:! prior to the administration of the 
measurement tests); rather, tests were administered in 
sufficj ent combinations to allow a determination of the 
effects of actual routing, had it occurred . 

Results showed the measurement tests to be more reliable 
in the groups for which they were intended than conventional 
broad-range tests. Predictive validities of the measurement 
tests, using grade point averages as the criterion, were 
slightly higher than those of the conventional tests. Their 



data also showed, however, that about 20^ of the testees 
would have been misclassif ied . or routed to an inappropriate 
measurement tes t . 



A series of studies of two-stage procedures was reported 
by Cleary, Linn, and Rock (1968a, b; Linn, Rock, and Clerry, 
1969). These were "real data" simulation studies, using 
the responses of ^,885 students to the I90 verbal items of 
the School and College Aptitude Tests and the Sequential 
Tests of Educational Progress. The total group was randomly 
split into a development group and a cross-validation group. 
Four 20-item measurement tests were constructed by dividing 
the total score distribution on the "parent" test into 
quartiles and finding the 20 items which had the highest 
within-quartile poin t-bi se ria 1 correlations with the total 
test score . 

Cleary e_t aj^. studied four diffcx-ent procedures of 
routing individuals to the measurement tests. The "broad- 
i^ange" routing procedures consisted of a 20-item routing 
test with a rectangular distribution of item difficulties. 
Based on their scores on these 20-items, individuals were 
routed into one of the four measurement tests. The second 
strategy was a double-routing or two-phase procedure. In 
the first phase, scores on 10 items of median difficulty 
(p=.5) were used Lo divide the group into halves. The 
second phase used two additional 10-item routing tests; 
scores on thase sets of 10 items were used to divide each 
first-phase subgroup into halves, yielding a total of four 
groups. The third routing procedure, called the "group 
discrimination" procedure, used the 20 items with the lar- 
gest between-quartile differences in item difficulties. 

The fourth procedure, called "sequential" routing, 
utilized the framework of the sequential sampling proce- 
dures developed by the Statistical Research Group (l9^5) 
aitd Waid (l950) and a specific procedure developed by 
Armitage (l950). In this method items vv^ould be administered 
to subjects one at a time. After scoring each item, "likeli- 
hood ratios" were computed and a decision was made either 
to assign the examinee to one of the four measurement tests 
or to administer another i tem . If tlie examinee had not been 
classified after all 23 routing items were administered, he 
was assigned to the group yielding the largest likelihood 
ratio . Cleary e t al . also used a 3-group sequential proce- 
dure with a maximum of 20 routing items. 

Scores on the two-stage tests were initially determined 
by scaling the measurement tests using linear regress.ion 
weights to predict the total score on the parent test. A 



-3- 

iater study (Linn et^ aJ^. , 19^9) added the routing score 
information to the scaled measurement test score. 

Correlations between the two-stage test scores (based 
on a maximum of ^3 items) and scores on the 190-item parent 
test were almost as high as the reliability estimates of 
the parent test. Scores from the sequential routing pro- 
cedure correlated highest with total score, followed by 

and ^2-item conventional tests, the group discrimination, 
bi^oad range, and double-routing procedures. Since the best 
short conventional test was found to require about 35% more 
items to achieve the same level of accuracy as the 3-gi"oup 
sequential procedure, it was concluded that two-stage tests 
can permit large reductions in the number of items administered 
to an individual with little or no Joss in accuracy. 

Validity results, in terms of correlations with external 
criteria of scores on the College Entrance Examination Board 
Tests and the Preliminary Scholastic Aptitude Tests, were 
even more favorable for the two-stage tests than were corre- 
lations with total test score. The group discrimination and 
3-group sequential procedures yielded the highest correla- 
tions with the criteria. With the exception of the double- 
routing strategy, all of the two-stage procedures had higher 
Vcilid'ties than conventional tests of equivalent lengths. 
In most cases, the ^0-item two-stage tests had higher vali- 
dities than 50-item conventional tests, and in five com- 
parisons they had higher validities than did the 190-item 
parent test. Thus, it was demonstrated that two-stage tests 
can achieve high predictive accuracy with substantially 
fewer items than would be necessary in a conventional test, 
although the data of Cleary et^ aj^. , like that of Angoff and 
Huddleston, showed a mj solas sificati on rate of about 20^, 

Lord (l971d) presents resuJ.ts from theoretical studies 
of two-stage testing procedures. All of his analyses were 
based on the mathematics of item characteristic curve theory 
and the following assumptions: l) a fixed number of items 
administered to each examinee, 2) dichotomous (right-wrong) 
scoring, 3) normal ogive item characteristic curves, 4) a 
unidimensional set of items, 5) all items of equal discrimi- 
nations, 6) peaked routing and measurement tests (i.e., ail 
items in each subtest were of the same difficulty), and 7) 
linear (i.e., non-branched) routing and measurement tests. 
Lord studied about 200 different strategies, varying the 
total number of items (l5 or 60), the number of alternative 
measurement tests, the cutting points for assignment to the 
second-stage tests, methods of scoring both the routing test 
and the entire two- s t age procedure , and whether or not random 



ERIC 



guessing- was assumed (for a 5-choice item, within the 
60-item tests only). Lord compared each two-stage strategy 
with a peaked conventional test of equivalent length in 
terms of information functions, which indicate the rela- 
tive numbers of items required to achieve equivalent pre- 
cision of measurement. Precision can be defined as the 
capability of responses to a set of test items to accurately 
represent the "true ability" of hypothetical individuals. 

Lord found that the linear test provided better measure- 
ment around the mean ability level of the group, but that 
the two-stage procedures provided increasingly better measure- 
ment with increased divergence from the mean ability level. 
The finding that the peaked linear test provided better 
measurement around the mean ability level has been supported 
by Lord^s other theoretical studies comparing peaked ability 
tests with tests "administered^' under a variety of adaptive 
testing strategies (Lord, 1970, 1971a, 1971c); thus, the 
peaked test always provided more precise measurement than 
the adaptive test when ability was at the point at ;vhich 
the test was peaked. However, as an individual's ability 
deviated from the average, the peaked test provided less 
precise measurement, and the adaptive test provided more 
precise measurement • 

The importance of these findings is that they indicate 
that the most precise or accurate measurement for any indi- 
vidual will be obtained by administering to him/her a test 
peaked at a difficulty level equal to that individual's 
ability level. Thus, test items should be of median, or 
p=.50, difficulty for each individual, rather than of median 
difficulty for a group of individuals varying in ability. 

But ability level, and thus the appropriate level of 
item difficulty for an individual , is not usually known in 
advance; it is the tes t ' s function to measure it. The two- 
stage strategy provides one method of adapting the difficulty 
of the test to the individual's ability level, in an effort 
to achieve more precise measurement. The routing test gives 
an initial estimate of an individual's ability level, and 
he/she is then routed or assigned to that "measurement" test 
which is peaked at a difficulty levol close to his estimated 
ability. 

Lord's theoretical study of two-stage testing procedures, 
based on the notion that a short routing test can be used to 
find the optimal peaked measurement test for any given indi- 
vidual, as well as the studies of Angoff and Huddleston (1938) 



ERLC 



-5- 



and Cleary al_. (iy68a,b; Linn a_l . , i96[)) show con- 

siderable potential for two-sta^je tests, in terms of in- 
creases in internal consistency x-eliability , validity, and 
precision of measurement. However, only Angoff and Huddle- 
ston*s was an empirical study, and even this study was not 
able to account for the effects of actual routing. The 
purpose of the present study, then, was to begin an empiri- 
cal evaluation of two-stage testing procedures; the study 
involved the development, computer-controlled administra- 
tion, and comparison of a two-stage test and a peaked con- 
ventional test* 

METHOD 

Design 

This study was part of a larger program of research 
involving a series of empirical comparisons of a number of 
major strategies of adaptive testing. These studies were 
directed at answering two major questions: l) Does adap- 
tive testing show any advantages as compared to conventional 
ability testing' procedures? and 2) Are some strategies of 
adaptive testing superior to others? To answer these ques- 
tions, the studies were designed to permit the investigation 
of l) the psychometric characteristics of tests administered 
under each adaptive strategy, in comparison with conventional 
linear tests, 2) the test-retest stability of ability esti- 
mates derived from each strategy, 3) the relationships between 
ability estimates derived from different adaptive strategies, 
and 4) the relationship between ability estimates derived 
from conventional testing and each of the adaptive strategies. 

The design involved the construction and computer- 
controlled administration of tests using each adaptive 
strategy and a conventional linear test. So that data con- 
cerning the inter-relationships between strategies could 
be obtained , the tests wei'e administered in pairs such that 
each combination of two tests would be administered to a 
large group of subjects. To obtain test-retest stability 
data, tests were re-administered to the same individuals 
after an interval of about six weeks. 

In the first phase of the research, a two- stage , a 
flexilevel (Lord, 1971b), and a conventional linear test 
were constructed. Each test consisted of 40 items dra.\m 
from a common item pool but selected so that there would 
be no overlapping of items between tests. The tests were 
then administered two at a time to a total group of about 



ERLC 



-6- 



350 individuals such that each combination of two tests was 
given to about 100 individuals. 

To examine the possibility of fatigue or practice 
effects or an interaction between test sequence and test- 
ing strategy, the order of administration of the tests within 
each combination was randomized on the first testing so 
that each test would be administered first to approximately 
half the testees and administered second to the other half. 
Retests were administered in the same order as the subject 
had initially received them. 

Computer administration was necessary only for the 
adaptive tests, but the conventional linear test was also 
computer-administered to control for the possibility of 
"novelty" effects resulting from an atypical mode of test 
administration . 

Although the first phase included the administration 
of a flexilevel test, the resu?.ts of its administration will 
be reported in a later paper. The present paper is con- 
cerned only with the evaluation and comparison of the charac- 
teristics of the two-stage and the linear test and with the 
relationship between ability estimates derived from the two 
tests • 

Of interest, first of all, were the characteristics of 
the score distributions yielded by the tests. It was ex- 
pected that the two-stage test, because it adapts the diffi- 
culties of the items to the ability levGl of the testee, 
would utilize more of the available score distribution than 
would the conventional test. On a r: onven t ional "peaked" 
test, item difficulties are appropriate for individuals of 
average ability but may be inappropriate for testees who 
deviate from the average ability at which the test is peaked* 
Scores of hi --h ability individuals may be artificially de- 
pressed if the items are too easy for them, and scores of 
low ability subjects may be artificially inflated if they 
correctly guess the answers to the large number ol items 
that will be too difficult for them. In the two-stage test, 
however, high ability subjects would be routed to more diffi- 
cult measurement tests, thus giving more "top" to the test, 
and low ability subjects would take measurement items more 
appropriate to their ability level, thus reducing the effects 
of random guessing. That the probability of random guessing 
decreases as item difficulties get closer to the subject 's 
ability level has been suggested by Lord (l970), Owen (iy69), 
Urry (1970), and Wood (l97l), among others. Thus, because 
the two-stage test adapts item difficulties to the testee's 
ability level, two-stage test scores should have higher 



-7- 

variability than scores from peaked conventional tests. In 
addition, the score distributions were examined to determine 
whether the tests yielded skewed, rectangular, peaked, or 
non-uni modal distributions* 



Another psychometric consideration was the internal 
consistency reliability of the tests. The purpose of the 
routing test is to assign each individual to that measurement 
test Composed of items most appropriate for him. Thus, 
routing, if it is effective, should form subgroups of indi- 
viduals for whom the assigned measurement test is composed 
of items of appropriate difficulty. For 5-alt ernat ive 
multiple-choice items, appropriate difficulty corresponds 
to a p-value of approximately .60 (Cronbach & Warrington, 
1952; Guilford, 195^; Lord, 1952>, items at that difficulty 
level maximize internal consistency reliability. Thus, 
maintaining item difficulty near this level for all or most 
individuals in the group should lead to Increased relia- 
bility of the measurement tests in comparison to that of the 
routing test or the linear test, in which items are of median 
difficulty only for some individuals in the group. Angoff 
and Huddleston (l958) found this to be the case; their 
"narrow range" (measurement) tests were more reliable for 
the group^D x*or which they were intended than were the con- 
ventional "broad-range" tests. However, the routing process 
should also create subgroups of individuals more homogeneous 
in ability. Because lower ability variance will decrease 
internal consistency I'eliabili ty estimates, the effects of 
more appropriate item difficulties may be counteracted. 

Thus, in comparing the internal consistency reliability 
of the measurement tests to that of the linear and routing 
tests, it was important, first, to evaluate the extent to 
which routing led to more optimal measurement test item 
difficulties; this was done by determining whether item 
difficulties in the measurement tests changed in the direc- 
tion of p= • 60 from their values as determined from the norm- 
Ing studies. Second, the extent of sub-group homogeneity 
was evaluated by examining the score variability within each 
measurement test. 

Lord's (l971d) theoretical demonstratiai that the pre- 
cision of measurement of two-stage tests was nearly con- 
stant over the v;hole ability range implies fewer random 
factors in the ordering of individuals in two-stage tests 
than in conventional tests. In conventional tests, which 
are most precise at average ability levels, scores of indi- 
viduals near the extremes of ability will be highly affected 
by random errors, and the ordering of such individuals will 



ERIC 



-8- 



be determined in large part b}' random factors. Because of 
the more nearly constant precision of two-stage tests, the 
scores for individuals at all levels of the ability dis- 
tribution are more likely to be based largely on underlying 
ability rather than on random factors; two-stage tests 
should thus yield higher test-retest stability coefficients 
than conventional tests. One c omplicat ing f ac tor, however , 
involves differential memory effects, A subject re-tested 
on the conventional test will repeat the same set of items, 
A subject retested on the two-stage test will take the same 
set of 10 routing items but may take an entirely different 
set of 30 measurement test items if he is routed differently 
tho second time. In comparing the * stability , then, of two- 
stage and conventional tests, it was necessary to account 
for the differential effects of memory. 

Some studies of twQ_-sJ:age testing procedures (e.g,, 
Cleary e_t al^. , 1968a, b; Linn jet a_l , , 196y) have evaluated 
their results in terms of the accuracy with which two-stage 
test scores estimated scores on a conventional test. The 
focus of adaptive testing, however, should be on imr^roving 
the measurement characteristics of scores derived from 
adaptive tests rather than on estimating conventional test 
scores. If it is true that two-stage tests yield more pre- 
cise measurement at the extremes of the distribution than 
do conventional tests, the ordering of individuals in the 
tails of the two score distributions should be different. 
Thus a relatively low correlation with scores derived from 
a linear test would provide evidence that the two-stage 
test was ordering individuals differently but would not 
indicate which ordering had the higher relationship to the 
trait being measured. Direct evidence pertaining to the 
latter issue must, of course, come from the examination of 
each test's relationship to independent ability criteria. 
Indirect evidence may eventually be derived from determin- 
ing whether the intercorrelations of a number of adaptive 
tests, all of which would be constructed to achieve more 
nearly constant precision throughout the ability range, 
were uniformly higher than the correlation of each with a 
conventional test. Analyses pertaining more directly to 
this issue will be reported in later studies in this series. 

Test Development 

Item Pool 

The item pool used to construct the adaptive and con- ' 
ventional tests of verbal ability consisted of 3-al t ernative 
multiple-choice vocabulary items. The items were normed on 
a large group of college students, and item statistics of 



ERIC 



"9- 



difficulty (proportion correct) and discrimination (biseral 
correlation with total score) were obtained. Using a biserial 
correlation of at least . 30 as a selection criterion 369 
items were available for use in constructing the tl tests 
to be administered in the first study. Tabic 1 desc^Lbes 
the available item pool as a cross-classification of levels 
of item difficulty and biserial correlation coefficient and 
shows the number of items available in each cell of the cross- 
tabulation. It may be noted that the pool consisted of con- 
siderably more very easy than very difficult items, and that 
the more highly discriminating items occurred at the easier 
levels of difficulty. 

Two-stage Test 

The two-stage test was composed of a 10-item routing 
test and four JO-^ltem measurement tests. Testees were 
assigned to one of the four measurement tests on the basis 
of their scores on the routing test. 

Items for each subtest were s elected to approximate 
the characteristics of the theoretical items used by Lord 
(l97ld) in his study of two-stage testing procedures. In 
describing the characteristics of the theoretical items, 
Lord used parameters based on assumptions of the normal 
ogive model in item characteristic curve theory (Lord and 
Novick, 1968). The characteristics of the real item pool 
used in this study were specified in terms of the tradi- 
tional item parameters of' classical test theory (i.e., pro- 
portion correct as an index of item difficulty, and item- 
total score correlation as an index of item discriminating 
power). The normal ogive item parameter values suggested 
by Lord were used to select the levels of item dfficulty 
and discrminat ion of the measurement tests. The routing 
test item difficulties and discriminations were tiected 
by other criteria. Following the selection of tMe routing 
and measurement test items, their difficultly and discrimina- 
tion values were converted to the normal cjive parameters 
for use in the scoring equation. 

Using Lord's notation, normal ogive parameter "a" 
represents item discriminating power and is related to 
the biserial correlation between item response and latent 
ability. Since latent ability estimates were not available 
for item norming, normal ogive iton. parameter estimates used 
in this study were computed using total norming test scoi^e 
as an estimate of latent ability. Although Lord assumed 
equally discriminating items in his theoretical two-stage 
tests, he admits it is rarely possible to construct real 
tests with equally discriminating items. In this study, 



-10- 



ERIC 



•H 

0) 

•H 
« 

-P 
01 
0) 

I 

E 
H 



+3 
fH 

a 

o 

•H 
Ch 
Ch 
*H 
Q 

^ E 

o 
-p 

H 



01 

E 

0) 

-p 

H 
?^ 

fH 

a 
j2 
cd 
o 
o 
> 

Ch 
O 

U 
0) 
X) 
E 



Cd 

c 
o 

•H 

-P 

Cd 

fH 
0) 

o 
o 



Ch 
O 



0 

Cd 0) 



01 -I 

E £ ^ 

• 0) o 

o 4-> cd 'h 

^ -H 0) O 



I 



-p 
o 

u 
o 
o 

o 

•H 

o 
a 
o 
u 



-p 

H 

o 

•H 
Ch 
Ch 
H 

E 

0) 

-p 

H 



O ON 
O C7N 



I 

O C7\ 
O C7N 
00 00 



O ON 



r 

O ON 

o C\ 
VO so 



O ON 
O ON 

in in 



O ON 
O ON 



O ON 
O ON 

cn cn 



I 

O ON 
O ON 

CM 



O C7\ 
O C7\ 



ON 
I ON 

o o 



cd 

H -P 

cd o 

•H H 

U I 

0) E 

(0 c 

»H -P 



C 
O 
•H 

-P 

Cd 

H 

0) 

u 
o 



CP H O 



cn 



cn 

CM 



in 



CM 
CX) 



in 



^cnoOfHONinooN 



CM o o 00 so 



cn 



00 00 



in 



ON SO 



CM 



NO 



CM 



cn 



00 



cn 



CM 



NO 



NO 



CM 



»n 



ON 
ON 



00 



NO 



in 



cn 



cn 



o o 

C7N 00 



I 

o 



I 

o 

NO 



I 

o 
in 



I 

o 



o 
cn 



^o 
cn 



CJN 

NO 



ON 

cn 



NO 00 ^ 00 

fH cn 



o 

cn 



cn 



o 

CM 



in 



CO 0) 

E > 

-P fH 
•H 

^1 
Ch O 
O Cd 

O -P Ch 

Cd o 



-11- 



items were selected whose discriminations clustered as 
closely as possible around the desired values. 

Item parameter "b" i3presents item difficulty and is 
essentially a normal distribution transformation of 1-p, 
although its exact value is dependent on the value of "a". 
This conversion makes item diff:^culty more easily interpre- 
table, since positive values correspond to more difficult 
items and nefe:=»^ive values to lesj3 difficult items. Lord's 
two-stage procedures used peaked routing and measurement 
tests, i.e., all routing items, i»,r i all items composing a 
particular measurement test, had a constant "b" value. 
Using real items, it was not possible to construct per- 
fectly peaked subtests; rather, desired values of "b" were 
selected for the measurement tests, and the items were 
selected to distribute closely around the desired values. 

Routing test . The 10 routing items were selected to 
have a mean item-total score biserial correlation of approx- 
imately .57. This value was selected to be somewhat higher 
than that chosen for the measurement tests in order to im- 
prove the assignment of testees to measurement tests. 

The difficulty level of the routing items was selected 
to fall at the median ability level of the group taking 
into account the probability of chance success on an item, 
as a result of random guessing (Lord's parameter "c"). Lord 
(1953, 1970) found that optimal measurement could be achieved 
at a difficulty level somewhat easier than the value of 
(l+c)/2. Since the items used in this study had 5 alterna- 
tive responses, "c" was equal to .2, and (l+c)/2 was equal 
to .60. The" mean difficulty level of the routing items was 
set at .62, slightly easier than p= . 60 . Thus, ten items 
with p-values distributed closely around .62 and biserial 
coefficients as close as possible to .57 were selected for 
the routing test out of the 369 items available. 

The first row of Table 2 summarizes the characteristics 
of the routing items. The mean, standard deviation, minimum, 
and maximum values of the traditional item parameters are 
presented. The mean "a" and "b" values were calculated for 
use in the scoring equation and are presented after their 
corresponding traditional item parameter values. It may be 
noted that the mean biserial correlation (.57) is very close 
to that desired, but the standard deviation (.07) and range 
of these values (.43 to .7l) show that the items were not 
equi-discriminating. Similarly, the mean item difficulty 
fell at the desired point (p=,62), but the 10 items, varying 
from p=.57 to p=,68, did not form a perfectly peaked test. 
Item difficulties were normally distributed, with a slight 



ERLC 



-12- 



CO 
0) 
3 
H 

> 

C 
•H 
E 
U 



O -P 
C 

u 

O C 
•H -H 
+^ H 

m 

•H TJ 

-P 

O O 
O I 

o 

E ^ 

-p 

O 

Ch 
o 

?^ 

E 
E 

a 

c/) 



o 

H 
-P 

H 

E 

H 
;h 
O 
U) 

■H 

E 

-p 
H 



•H 



-P 
O 
0) 

U 

o 
o 

C 

o 

•H 

-P 
;h 
O 

o 

U 



O 



o 



o 



ITS 



00 



O 



00 

o 



in 



ON 

o 



in 
cn 



00 

o 



cn 



cn 



CM 
CM 



O 

cn 



00 

in 



00 

o 



in 



o 
cn 



00 

o 



cn 



cn 



o 

00 



o 



cn 



OS 

in 



cn 
cn 



00 



in 



ON 



I 



00 



in 
o 



ON 
00 



in 



cn 



in 



o 



00 



00 
O 



in 



O 

cn 



o 
cn 



o 
cn 



o 
cn 



o 





■p 








0 




E 




0 


C 


u 


•H 




-P +^ 




ID D 




0) O 









U 

0) 

•H 
►J 



-13- 



tendency toward flatness rather than peakedness. Appendix 
Table A-1 shows the characteristics (p-value and biserial 
coefficient) of each of the 10 routing items. 

To make assignments to measurement tests, score ranges 
on the routing test of 0 through 3j ^ and 5, 6 and ?» and 
8 through 10 were used respectively to assign testees to 
each of four measurement tests. The lowest score range was 
the widest since it was expected to include many "chance" 
scores . 

Measurement tests . In selecting the measurement test 
items, a mean item biserial coefficient of .45 was desired. 
This value corresponds to an "a" of approximately .50, 
which is the value of item discriminatory power used by Lord 
in his theoretical studies of adaptive testing (Lord , 1970 , 
1971a, c,d). 

In choosing the difficulty levels of the measurement 
tests. Lord calculated a value equal to a(b2 - b), where 
b^ is the difficulty of a particular measurement test and b 
is the routing test difficulty. These vialues were distributed 
relatively symmetrically around zero and ranged from -1*5 to 
+1.5 when six measurement tests were available. Because 
four measurement tests were used in this study, values of 
+ 1.0, +.40, -.40, and -1.0 were selected for Si{h^ - b) . The 
corresponding mean item difficulties of the four measure- 
ment tests were p=.26, p=.46, p=.73> and p=.88. Thus, in 
constructing the most difficult measurement test, the 30 
items having "p" values closest to .26 and biserial co- 
efficients distributed around .45 were selected; a similar 
procedure was followed in constructing the other three measure- 
ment tests. 

The resulting characteristics of the four measurement 
tests are summarized in Table 2. It may be noted that the 
mean item difficulties of tests 1 and 4 were slightly 
different from the desired values; this was due to the 
necessity of taking item discrimination as well as item 
difficulty into account. However, the resulting values of 
a(b^ - b),- which were +1.09, +.39, -.4o, and -1.13, were 
gooci approximations to the values specified beforehand. As 
with the routing test, item difficulties of each of the 
measurement tests were normally distributed around the mean 
value. Also, the mean biserial correlations foi' the two 
most difficult measurement tests were lower than those for 
the two easier tests. This was due to the relative scarcity 
of difficult items having high biserial coefficients as was 
indicated in Table 1. And while the mean biserial levels 
were relatively close to the .45 value desired, the standard 



ERLC 



-14- 



deviation and range of these values show that it was not 
possible to construct equi-discriminating tests using 
the available item pool within the limitations of the 
research design (i.e., the construction of several non- 
overlapping tests). Appendix Tables A-2 through A-5 give 
the characteristic s of each of the 30 items in each measure- 
ment test in terms of p-values and biserial coefficients. 

Thus, the two-stage test consisted of a normally dis- 
tributed routing test whose mean difficulty fell at approxi- 
mately the median ability level of the group (under the 
assumptions of random guessing), from which testees were 
routed or assigned to one of four normally distributed 
measurement tests whose means were located at point.s on 
the ability continuum distributed around the median ability 
level of the total group. 

Sc oring . The method used to score the two-stage test 
was derived from Lord 's (l971d) the oretical work. It con- 
sisted of obtaining the maximum likelihood estimates of 
ability from the routing test (^j , where 9 indicates position 
on the latent ability continuum) and the measurement test 
(q^)* After these two estimates were obtained, they were 
wexghted and then averaged to obtain a composite ability 
estimate, G. In this study, the estimates of derived 
from the routing and measurement tests were determined by 
the following formula: 



In this formula, a represents the mean discrimination value 
of the subtest items, x is the number correct, m is the 
total number of items administered in that subtest (either 
10 or 30) , c is the chance-score level (always .2), and b 
represents the mean difficulty of the items in that subtest. 
Whenever x-m (perfect score) or x=cm (chance score), 0 
cannot be determined. Therefore, when x was f^qual to m, 
it was replaced by x=-.5, and when x was less than or equal 
to cm , it was replaced by x=cm + . 5 . 

Lord (l971d) admits^ that there is no uniquely good way 
to weight the subtest s . He computed variance weights, 
but a preliminary examination of the results of applying his 
weighting formula to the two-stage data from this study 
showed some non-monot cnici ty in the relationship between 
the number right obtained on the measurement test and the 



-15- 



^ total test Q for people who obtained the same routing 
score. Therefore, rather than using the variance weights, 
each subtest 6 was weighted according to the number of 
items on which it was based; the resulting total score 
estimates were then strictly monot onically related to the 
actu^il number correct on the measurement test, given the 
same routing score. The ability estimate used in this 
study, then, was defined by the following equation: 



e = 



+ 362 



Scores determined in this way have values similar to standard 

or "z'* scores (Lord & Novick, iy68) , i.e., most will fall 

between +^3 > and the meaning of a B of +1 corresponds to that 
of a standard or "z" score of +1. 



In the following sections, references to "two-stage" 
scores will always refer to ^; scores reported for the 
routing and measurement tests, on the other hand, will 
always refer to the number correct on the particular sub- 
test in question. 

Conventional linear test > Lord (l:^7ld) compared his 
60-item two-stage tests with a 60-item peaked linear test 
having equi-di scriminat ing items (biserial correlations 
with the underlying trait of about •h3) • The linear test 
used for comparative purposes in this study had 4o items 
so that its length would equal that of the two-stage test. 
Items were selected from the pool shown in Table 1 that had 
difficulties closest to p=»55 and item-total score biserial 
correlation coefficients closest to .^5* The mean, standard 
deviation, mir.imum value, and maximum value of the linear 
test item difficulties and biserial coefficients are shown 
in Table 2. Again, the mean values of the normal ogive 
parameters are presented for comparative purposes. As 
was true for the routing and measurement tests, the ];:3.3riear 
test was neither equi-di scriminat ing nor perfectly peaked. 
The linear test did have a smaller range of item biserial 
values (.32 to .5^) than did the two-stage subtests, and 
the range of item difficulties (.4l to .66), while large 
for a peaked test, was small in relation to the range 
covered by all of the four measurement tests. The dis- 
tribution of linear test item difficulties, like that of 
the two-stage subtes^ts, was normal. Appendix Table B-1 
presents the p-value and biserial coefficients for each of 
the ho items in the linear test. 



ERIC 



-16- 



An individual's score on the linear test was simply 
the number of correct responses given to th'^ 40 items; 
thus scores could potentially vary from 0 and ^0. 

Administration and Subjects 

The tests were administered to undergraduate students 
taking the introductory psychology and basic psychological 
statistics courses at the University of Minnesota . The 
students were tested at individual cathode-ray- terminals 
(CRTs) connected by acoustical couplers to a time-shared 
computer. The CRTs were located in quiet rooms, and there 
was a maximum of 3 students in each room at one time. An 
administrator was present at all times to help students with 
the terminal equipment and to ensure that no consultation 
took place among testees. A set of instructional screens 
preceded the beginning of testing on all of the initial 
tests, and the students were given the opportunity to re- 
view the instructional screens before taking the retest. 
Few students had difficulty operating the terminals after 
completing the. instructions ; CRT test administration thus 
seems quite appropriate for college students. 

On the first testing, 2lh students completed the tvo- 
stage test and of these 112 also took the linear test (the 
remainder completed a flexilevel test). The students were 
retested after a mean interval of 39 days (about 3^ weeks), 
with a standard deviation of 11 days and a range from l4 to 
62 days. Of the 2lk students who completed a two-stage test 
on first testing, 178 were retested, and of these 85 also 
completed the linear test a second time (the remainder com- 
pleted another adaptive test. on retest). 

Analysis of Data 

The data to be analyzed consisted of 2 two-stage test 
scores, one from the initial test (time l) and one from the 
retest (time 2), for each individual. For about half of the 
group there were also 2 scores (test and retest) from the 
linear test. The time 1 data was divided into 2 groups, 
one consisting of those subjects who had taken the two-stage 
test first and the linear test second (order l) and the other 
consisting of those subjects for whom the ordei^ was reversed 
(order 2). To analyze the effect of order of administration, 
mean scores from order 1 and order 2 for the two-stage test 
and the linear test were compared using a t-test of the sig- 
nificance of mean differences. Table 3 presents the score 
means and standard deviations derived from order 1 and order 
2 and the value of t and its associated probability for each 
comparison. Since there were no significant differences 



ERLC 



-17- 



o 
o 
(n 

cn 

0) 

o 



-p c 
f 



o 

-P 
•H 

'O 

C 

-p 

0} 

C 



0 

-a 

O 0) 

0) £ 
Cm 

Cm Ch 

0 o 

U 0 
0 O 

^ OS 

o o 

•H 
<H Cm 

0 -H 
C 

•H -H 

01 0} 

CD 



0) 

C M o 



0) 

o 
c 

03 
o 

H 

■H 

H 
03 

Ch 
0 

-P 

tn 
a; 



a 



o 

CD O 



0 

> 



CO 

I 

o 



CD I 



u 

CD 
C 
•H 

H I 
f 

J-i CD 
I 

o 



in 



O 



in CM 

« • 

CM 



ON cn 



CM 
CX) 



O 

00 



1!^ 



if) 

03 



ERIC 



-18- 



betKijen means for either the two-stage or linear tests, 
order of administration was concluded to be an unimportant 
variable, and all subsequent analyses were done with data 
.from the two order groups combined • 

Characteristics of Score Distributions 

Analyses of tho characteristics of the score distribu- 
tions were done separately for initial test data and for 
retost data. The score means and standard deviations were 
calculated for each distribution, but because the scores 
were expressed in different terms (i.e,, number correct 
for the linear test versus position on a latent ability 
continuum for the two-stage), the scores and tiieir means 
and standard deviations were not directly comparable.-^ 
Thus, in order to compare the variability of the score dis- 
tributions, an index of relative variability w^as computed. 
This index indicates the extent to which the potential score 
rani^e is effectively utilized and was computed by dividing 
the standard deviation of each score distribution by its 
total potential score range. The score range for the linear 
te.st was 40, and that for the two-stage test was 6 
standard deviations on the latent ability continuum). 

To determine the nature of the score distributions, 
measures of skewness and kurtosis were obtained and tested 
for significant departures from normality (McNemar, l)j6S , 
pp. 25-28 and 87-88) . 

Reliability 

Internal consistency , Internal consistency reliability 
for the linear test and for each subtest (i.e., routing test 
and the four measurement tests) of the two-stage test was 
estimated by the Hoyt (l9^l) method. However, since the 
rel iabi] i t ies of the linear test, the routing test, and the 
measurement tests were based on different numbers of items, 
they were not directly comparable. Thus, the Spearman- 
Brown prophecy foi^mula was used to project the reliabilities 
of the two-stage subtests to what they would be had they 
been based on 40 items (the length of the linear test) 
rather than 10 items (routing) or 30 items (measurement). 

To determine whether or not the measurement test item 
difficulties were appropriate* for maximizing internal con- 
sistency, the mean difficulty of the items in each measure- 
meni. test for that group of subjects who had taken it was 
cal<;ulated. For further comparisons of the item statistics 



The linear test scores could also have been expressed in 
terms of 9, or position on the latent ability continuum. 
However, since most conventional tests are scored using 
"number correct", that scoring method was used in this study 
to maintain practical relevance of the results. 



dorivoci jVom the norming and the actual test administra- 
tion, the iiKjans and standard deviations of the discriminations 
(l)is<'riaJ correlation with total scoro) of the measurement 
ti^st items vvere calcuiatoti. The item difficulty and dis- 
crimination statistics were also calculated for the linear 
and routing tests. The total score used in these calcula- 
te oris was the number correct score on the linear test, and 
the nninber correct on the two-stage subtest rather than ^. 
The item statistics for the linef^r and routing tests were 
based, of course, on the total ^roup of testees, whereas 
.:hoso for the measurnment tests vv'ere based only on that 
more homogeneous group of testees who had completed each 
measurement test. 

To determine the extent to which the routing process 
liad led to a restriction of range, or greater homogeneity 
of ability, within each measurement test subgroup, the 
means and standard deviations of the number correct scores 
on each measurement test, and also on the linear and rout- 
J.ng tests, were calculated. To facilitate comparison of 
the standax^d deviations, which were based on tests of 10, 
30, or 40 items, each standard deviation was divided by its 
total potential score range (the number of items in the 
test) to obtain the index of the extent to which the poten- 
tial score range was used. 

St abili ty . A series of analyses of test-retest sta- 
bility were done. First, Pearson product-moment correlation 
coefficients were calculated for the test-retest score 
distributions of each test. Eta coefficients and the sig- 
nificance of curvilinear relationships between the test and 
retest scores were also calculated. Second, to examine the 
effect of interval length on test-retest stability, the 
total group was divided into three subgroups according to 
the length of interval between test and retest. The three 
groups were short interval (l'(-30 days), moderate interval 
(31-4() days), and long interval (^7-62 days); product- 
moment co'^relation coefficients were then calculated for the 
test-retest scores of the individuals in each subgroup. 

Third, in order to analy>,e the effect of memory of the 
i.ttims on test-retest stability, two-stage stability coeffi- 
cients wei^e calculated using only those individuals Avho were 
routed into the same measurement test on both testin^^s. 
Thopi^ individuals thus took the same hO items on test and 
ro^ps:, tlierefore making the effects of memory comparable 
to that ol the linear test, on which all subjects repeated 
the same items. 



ERIC 



Ad d i t i on a 1 An a 1 y s e s 

To analyze the relationship between the two-stage and 
linear test scores, product-moment correlations and eta 
coerficients for each total t^core distribution regressed 
on th(^ other one were computed. Tests of curvilinearity 
were made to determine if there were non-linear relation- 
ships between the two score distributions. 

Other analyses concerned certain characteristics of 
the two-stage test itself. First, the distribution of 
routing test scores and the number and percentage of indi- 
viduals assigned to each measurement test were examined in 
order to evaluate the appropriateness of the difficulty 
level of the routing test and the score intervals selected 
for assigning testees to measurement tests. Second, the 
number and percentage of misclassif ications into measure- 
ment tests was determined; the criteria selected to identify 
misclassif ied individuals wex^e l) perfect scores (all 30 
j.tem.s correct), indicating that the measurement tost was 
too easy, and 2) chance scores (6 or less correct responses), 
indicating that the test was too difficult. 

RESULTS 

Comparison of Two-sta^e and Linear Tests on Psychometric 
Charact eristics 

Variabil ity ^ Table ^ presents the means, standard 
deviations, and the "proportion of range utilized" index 
of variability for the two-stage and linear test scores. 
The data in Table h show that the two-stage scores utilized 
a slightly larger proportion of their potential range than 
did the linear test scores, on both the original testing and 
the retest. Further, although the mean scores on both tests 
ancrensed on the retest^ the standard deviations and the 
proDortion of range utilized were the same on ori^jinal te-st- 
inr. aiid on retest for both the two-stage nnd Unoar test 
scores, thus suggestinf^ consistency j.n the extent to which 
scores derived from ear:h test utilixe^d the nvn. Inhle scoi-e 
rar.;e , 

Shape ol' the score d 1 s tribu 1. 1 on;> . Table 3 pr*esents 
data describing the two-stage and linear score distribu- 
ti >ns. The two-stacc distj^i butions , for botli tns|- and 
rf'ic'St, satisfied the criteria of normality, since ne.i. fher 
tho indices of skewnes.^ nor kurtosis were significantly 
dii ff^rent from zoj-o. lloweve^.-^, there was some tendency 
toward positive skew and flatness in both distributions of 



-21- 



J3 



ERIC 











•n 
w 






























o 
















c 
















o 
































































Q 
















Cu 
















Q 


















































































































CO 






















































r-i 


o 
































c o 










1 




MJJ 


O 
















•H 














•H 




'Ji 














U 


<1J 












•H 


O 'H 










00 




»> 


0- ^ 


o 












•H • 


O 'H 














w w 




w 




























4j 














W 














d) 












C > 
















c3 






Ch 


0 










^ 




o 


SI 






d o 


o 


OJ 






*H 






1 o 


c 


0 












3 (/) 


o 






o 










*H Q) 


*H 




•rf 








1— 1 








+^ 


— ^ 








rti c 






















o 












c 




a 


hn 






(/) ^ 


0 * 


rti 




o 


c 






•rl 0 


U U 






U 










VU 






0. 


M 






T! O 
\J ^ 




hn 














M n 






























*rl 0 


U ri 




0) 










.—1 ^ 






E 










*H <P 




o 


•H 




J— < 










> 


H 










































































n n 


TO *H 








r-« 

U 




















J 


















*w 












1 
1 


















vM 


O !> 
















*i 
w 
















*-« w 
















o 
















































M tU 










































o :d 










0) 




u ^ 










tlD 














C3 














4J 














W 


c 












1 


0 


0 










o 




+^ 








0) 






o 

















U 

o 

U 
U 

C U 

d C3 



4J 



•H 

O 

O (D 

-a I 

r: o 

t/} O 

o 

4J t/} 

^ C 

D O 

^ 'H 
-P 

t/) -P 

t/) 0) 

c -a 







CM 


• l 






C/5 


• 


• 



w1 


00 








CM 


[/) 







IS 


00 




"J 


1 ^ 


CM 








C/5 




1 



0 






n 


E 


W 


00 


•H 






CM 


H 









-23- 



two-stage scores. The linear test scores, on the other 
hand, showed some tendency, although not statistically 
significant, toward negative skew and showed a marked 
tendency toward flatness on the initial test. The latter 
result was statistically significant at the .02 level. 

Reliabili ty 

Internal consistency . Table 6 presents the Hoyt in- 
ternal consistency reliability coefficients for the linear 
test and each two-stage subtest, and the estimated relia- 
bility of each subtest had its length been ^0 items. It 
is evident that the linear test and the "40-item" routing 
test were highly reliable and more reliable than any of 
the measurement tests. The two intermediate difficulty 
measurement tests (tests 2 and 3) had especially low re- 
liability coefficients* These findings are contrary to 
those of Angoff and Huddleston (I958), who found that the 
measurement ("narrow-range'') tests were more reliable than 
the conventional ("broad-range") test. The results are also 
contrary to the expectation that higher reliabilities would 
result from more appropriate item difficulties, i.e., item 
difficulties close to -60, the median difficulty with chance 
taken into account, in each measurement test. 

Table 7 shows che mean item difficulties for each two- 
stage subtest and the linear test. The means for the linear 
test, both time 1 and time 2 (.60 and .64) were very close 
to -60, and those for the routing test (.68 and .7l) al- 
though somewhat easier, were still relatively close to .60. 
On tho other hand, with the exception of test 3> the measure- 
ment tests were not maximally appropriate for the groups 
taking them, since their mean item difficulties were not 
close to p=.60. Measurement test k was obviously too easy 
for those routed to it (p=.78 and .81) while measurement 
test 1 (p=.A3 and J\k) was too difficult. 

However, in addition to the fact that thr<-e of the four 
measurement tests werv:? not of optimal, difficulty, there 
was evidence for a restriction of rai^c or deci"(»asod group 
heterogeneity and, thus, depressed internal inconsistency 
I'f; i labili ty coefficients. Table 8 shows the means and 
standard deviations of the number correct scoa os for the 
tvro-stage S!ibtests and the linear test and th< standard 
de^ iations as proportions of the number of it'Mns (poten- 
tial range) in each test. As is shown, the proportion of 
potential range used by the lO-item routing test (.23 on 
bot.h test and r«^test) was somewhat greater than that usrd 
by the 40-item linear test (.21 both times). But the 



ERIC 



-2h- 



ERIC 



a; 



o 
5 

U 



I/) u 
cd o 

5 













-P -H -P 




tA ^ C/) 








+-* +^ 








fclD -H -P 




C -H C 




•H 0) O 




-P Sh S 








o -o i-i 




0) a 




-p cn 








0 E 0) 


0) 


•H E 


fH 


?^ -p 


X) 


-p cn 'O 


OJ 


•ri O C 






•H -O 








03 03 




•H -H 




^ -P 




0) -p a 




U ^ 0 








?^ -P 




c E 




C O 




o; m 




•P O -H 




a: C 1 




-H O 




tn -^ -T 




c 




0 ID 




O C 








H 








c 




H 




i; 















■p 

a; H O X3 

-P -H-^ 3 



03 X5 
E 
•H 



(0 



E 
•H 



g3 03 
H E 

cn o O +J 



Cm 

O cn 
E 
• o 

O -P 



?^ -p 
■p c 

•H o 

H -H 

•H O 

+J ^ 'H 

P^ 03 

O -H Ch 

O O 



>^ 
-P 

-D -H I 
O H O XI 
-P -H 
03 X5 
E 03 03 
H 'H 
-P H ^ O 
tn O O -P 



O 



O -P 



•H O 
-P -H 
O 03 <H 

-H <H 

C 



HI 



O 



O 



00 



0\ 
00 



X 



c 

•H 

3 
O 



00 



0^ 



00 



00 



O 



o 



O 



O 



O 



O 



00 



o 
in 



o 



o 



cn 



00 



vn 
00 



cn 

00 



in 



00 



00 



o 

cn 



O 

cn 



O 

cn 



O 

cn 



O 



^ 00 



00 



c 
cn 



CM 



E 

03 
2 



-J CM 



d 




Table 7 

Mean and standard deviation of item 
difficulties (proportion correct) obtained 
from administration of the two-stage and linear tests 



Test 



No. 

items. 



Proportion correct 
Time 1 Time 2 



Mean 



S.D, 



No . 

items 



Mean 



S.D, 



Routing 10 
Measurement 

1 30 

2 30 

3 30 
k 30 

Linear ^0 



.68 

.51 
.6k 
.78 
.60 



.12 

.16 
. 11 
.15 
.13 
.11 



10 

30 
30 
30 
30 
ko 



71 

kk 
hi 

69 
,81 
.6k 



09 

.15 
,12 
,11 
.13 
.12 



ERIC 



-26- 



00 



ERIC 



u 
o 

o 

•H ^ 

-P CO -P 

05 E w 

•H <D (D 
> -P +^ 
CD -H 

Ch 
TJ O 



05 U 



U 
05 

C 
•H 



05 
-P 
CO 



C -P 



c c 

05 0^05 

C 05 -P 

o ^ cn 

•H 0) 

-P H -P 

05 05 ,Q 

•H -H D 

> -P CO 

-p ^ 

O 05 



X3 ^ 



P- -P 

CO 



I 

0 

C -P 

o 

•rl OJ 

05 0- 
o o 



05 

-p 

0) 



cn 



c/} 





00 




C\i 


cn 


H 


« 




rH 


H 




C\i 


00 


in 


00 


00 


C\i 


C\i 










00 




• 


• 










cv 


lO 




cn 


cn 


00 


CO 




o 


c\ 








cn 






H 




• 


cn 




o 




in 




H 


H 


C\i 


C\i 




00 


cn 


H 


00 




in 




On 






H 


00 



C^i 



cn 


00 


in 


H 




H 


• 


H 


H 


H 


H 


C\i 


H 


00 


H 


cn 


H . 




cn 


CM 


in 


cn 


00 








• 










m 








00 


00 


00 


00 










C^ 




H 




H 


• 


C\i 


in 


ON 


cn 






—1 


H 


r-l 


C\i 


cv 




H . 


—1 




cn 


o 


H 






cn 


c\i 









■p 






c 






Q) 






E 






(1) 










•H 


D 






CO 


CO 




05 


0) 


0 


<D 


H 







C\i 



C^* -:t 



c 

•H 



-27- 



msasurement tests, v^^hich had 30 items, used considerably 
less of the potential range than did either the routing 
test or the linear test. Meastirement test 3 used only 
half as much of its potential score variability as did 
the linear and routing tests. Referring back to Table 6, 
it is interesting to note that the reliability coefficients 
are very closely related to the proportions of potential 
range used by each of the tests. For example, measurement 
test 3 was both the least variable and the least reliable. 
In general, the rank order of the tests or subtests in 
terms of internal consistency reliability corresponds to 
their rank order in terms of score variability. Thus, it 
would seem that the increased homogeneity of the groups of 
subjects taking each measurement test, as evidenced by the 
low score variability, was an important factor in the un- 
reliability of the measurement tests. 

The low score variability of the measurement tests 
in comparison to that of the linear test is in contrast 
with the comparatively high variability of the total scores 
on the two-stage test as was shown in Table However, 
given the fact that the testees were all college under- 
gi^aduates, a group that can be assumed to have an already 
restricced range of ability from that in the general popu- 
lation, it is not surprising that dividing this total group 
into four subgroups even more homogeneous in ability led 
to reduced score variability. It is likely that the measure- 
ment tests would show higher reliability if the two-stage 
test were administered to a group more representative of 
the general population in terms of a greater range of ability 
levels . 

Stabili ty . Table y gives the test-retest stability^ 
correlations for the two-stage and linear tests. The first 
three sets of columns show the stability correlations as 
a function of thn length of the interval between test and 
rotest: the last two columns show the stability of each 
test as computed on the total group of subjects. 

The length of the interval between test and retest 
did not have consistent effects on stability. The linear 
test was most stab]e in the interval of medium length 
(r=.wL) and least stable in the longest interval (r=.87), 
whereas the two-stage test was most stable in the shortest 
interva]. (.92) and least stable in the medium-length 
interval (.85). It is interesting, though possibly not 
significant, to note that the two-stage test was more 
stable over the longest interval tli.m the linear test. 
This may have some implications for tho relative importance 



ERIC 



-28 











a 
















3 




TN 


00 










O 




00 


00 
















• 










tie 
































03 










o 






P 
















O 






in 










H 










o 




























•P 














O 














c 
















D 










00 


00 




Ch 


o 








• 


• 






U 














cti 






1 












































03 








cv 








P 
















O 












C 


P 












o 














'H 




03 










•P 


o 


T3 








in 


ON 


03 


Ch. 










00 








C 






• 


• 


<D 




T3 


•H 














C 


— ' 














03 














o 














H 


o 




03 






00 










> 






cv 


^0 






•P 














•p 




0) 














•P 












0) 


C 










•H 




H 










43 








ON 


cv 




03 




P> 






00 


ON 




P 


03 






u 


• 


• 




Cfi 


> 


0) 


o 














•P 












•P 


0) 


0) 


1 










(0 


•P 














<D 










in 






•P 


'H 






























^ 














1 














■P 














(0 














0) 














H 
























0) 














eiD 














03 












u 


P 












03 


Cfi 












p 


<U 


1 














rj 


o 












0) 


'H 


IS 












H 


-) 





-29- 



of memory effects in the stability of the two tests, i.e., 

if memory of the items is important in the stability of a 

test, the longer the interval, the less effect memory will 

have and, thus, the lower will be the stability coefficient. 

The linear test (r=.89) had a slightly highei-^ total 
group stability than the two-stage test (r=.88), but the 
difference was not significant and could easily have been 
in the opposite direction. Tests for curvil ineari ty , using 
the product-moment correlations and eta coefficients, showed 
that the relationship between the test and retest scores was 
primarily linear, with no significant curvilinearity . 

In addition to the effect of interval length on the 
obtained test-retest stability coefficient, the other 
factor considered was the effect on the size of the sta- 
bility coefficient of memory of the items on the retest. 
The stability of the linear test, which was r= . 89 , was 
based on the correlation between the test and retest scores 
of subjects who had repeated the same kO items. The sta- 
bility of the two-stage scores was, therefore, calculated 
only for the 97 subjects who were assigned to the same 
measurement test o.n both test and retest, thus also re- 
peating the same ^0 items. That test-retest stability 
correlation was .93? higher than both the linear and the 
total group two-stage stability coefficients. Thus it 
would appear that the stability of the linear test was 
based to a larger extent on memory of the items than was 
that of the two-stage test, suggesting that the latter 
yields ability estimates which more consistently reproduce 
the testee's ability over the time interval between test- 
ings . 

Relationships between Linear and Two-stage Scores 

Table 10 presents the linear (product-moment) and eta 
coefficients describing the relationships between the two- 
stn^^e and linear score < ] i s t r ibu t ions on test aiid rett-st. 
All of the linear and eta coefficients were si{jnificant at 
p < .001. The only significant degree of curvilinearity 
was found in the regression of the linear scor{3S on the 
two-stage scores for the initial test, although there was 
a tendency toward curvilinearity (p=.12) in the vegression 
of two-stage on linear scores on the retest. Examination' 
of t:he bivariate scatter plots showed that the curvilineai'i ty 
was due to a restriction of range in the lower end of the 
linear score distribution in comparison to the greater 
utilization of the two-stage score range at the lower ends. 

The linear relationship betwe-M the two-stage and 
linear test scores was relatively high on .both test and 



ERIC 



-30- 



Table 10 

Regression analysis of relationships between 
two-stage scores and linear scores, and tests for 

curvilinear! ty 
(N=110 Time 1, N=:85 Time 2) 



^ Time 1 Time 2 

Product -moment 

correlation .84 .80 

Eta coefficients 
Regression of two-stage 

scores on linear scores (eta) .85 .84 

Significance of curvi- 

linearity (p-value) .74 ,12 



Regression of linear scores 

on two-stage scores (eta) .88 .82 

Significance of curvi- 

linearity (p-value) .04 . 90 



ERIC 



-31- 



retest (.84 and .80). However, these values also indicate 
that the proportions of variance accounted for (r^) were 
only .70 and .64, respectively. The proportions of reliable 
variance in the linear test, as given by the Hoyt internal 
consistency reliability coefficients, were .89 and .90; 
thus, the correlation between the two-stage and lineai test 
scores failed to account for 19^ of the reliable variance 
in the linear test on initial testing, and 26% on retest. 
It would appear, therefore, that the linear test and the 
two- stage test are not interchangeable approaches to measur- 
ing the same ability. 

Comparison of Norming and Testing Item Statistics 

Since this study is the first to report on non-simula- 
ted t\\o-stage test administration, it is appropriate to 
examine the effect of actual two-stage testing on item 
characteristics. Relevant data from both the two-stage 
and iinec?r test have been presented earlier in Table 7} 
additional data are in Tables 11 and 2. 

Item difficulties . Table 7 gives the means and stan- 
dard deviations of item difficulties as obtained from actual 
administration of the two-stage and linear tests. These 
values may be contrasted with the values as obtained from 
the norming studies, which were presented in Table 2. 

It may be noted, first of all, that the linear and 
routing te^sts, both of which were taken by the total group 
of subjects, were somewhat easier for the tested group 
(on first testing) than they had been for the norming sample. 
On the linear test, average difficulty for the norming group 
(Table 2) was p=.56, while for the tested group (Table 7) 
it was p= . 60 (time l). On the routing test the respective 
average difficulties were p=.62 for the norming group and 
p=:.68 for the tested group. Since both of these differences 
were statistically significant (p < .O5), it is possible 
that the tested group was slightly superior in verbal ability, 
although both samples were taken from the same population . 

However, of more importance in this study was the effect 
that changes in group composition toward greater homogeneity 
in ability level, caused by the routing process, would have 
on the item difficulties of the mea.surement tests. On 
four measurement tests, the testing mean item difficulties 
changed in the direction of p= . 60 from their norming values. 
The two more difficult measurement tests (l and 2), with 
norming means of .2k and .46, were significantly easier 
(p < .001 and p < .Ol) and closer to median difficulty for 
the groups of testees routed into them (p=.43 and . 51 



ERLC 



-32- 



Table 11 



Mean and standard deviation of item discrimination 
values (biserial correlation with total number correct) 
from administration of the two- s t age and linear tests 



Time 1 



Time 2 



Test 



No . 

it ems 



Biserial coefficient 



Biserial coefficient 



Mean 



S.D. 



Mean 



S.D. 



Routing 10 
Measurement 



1 

2 

3 
k 

Linear 



30 
30 
30 
30 
ko 



67 

49 
39 
31 
,60 

56 



10 

Ik 

19 
19 
32 
15 



.69 

.h6 
.37 
.37 

.58 



.11 

.16 
.18 
.25 
.42 
.16 



ERIC 



-33- 



respec tively ) . Similarly, the two less difficult tests (3 
and A), with norming values of .73 and .89, were signifi- 
cantly more difficult (p < .05 and p < .OOl) and closer 
to median difficulty for the subjects taking them ('p=.64 
and .78 respectively). These findings suggest that each 
measurement test was more appropriate to the ability level 
of that subgroup taking it than it would be for the total 
group of subjects. 

Tables 2 and 7 also show that the testd^ng values of the 
standard deviations of the item difficulties were uniformly 
larger than the norming values* This finding implies that 
groups of items which show very similar characteristics 
when normed on one group of subjects may show more diver- 
gent characteristics when administered to groups differing 
from the norming sample in composition and range of ability 
levels. 

Item discriminations . Table 11 presents the means and 
standard deviations of item discrimination values (biserial 
correlation with number correct) as obtained from the ad- 
ministration of the tests. A comparison of these values 
with the norming values as presented in Table 2 shows that 
the testing mean item discrimination values for the linear 
and routing test were higher than the corresponding norming 
values; the mean biserials of the linear test items were 
.47 from the norming studies. but .56 and .58 from the test 
and retest, and the routing test increased from a mean dis- 
crimination of .57 in norming to .67 and .69. In contrast, 
the only measurement test to show higher item discrimination 
values on both test and retest was test 1, the most diffi- 
cult test, whose means were .42 in norming but .49 and #46 
on test and retest , The items in tests 2 and 3 were less 
discriminating in testing than they had been in norming, 
and those in test 4 were more discriminating on the first 
test but less discriminating on the retest. Further, the 
standard deviations of the item discrimination values were 
again larger in testing than they had been in norming. The 
items in test 4 especially showed much greater variability 
in their discriminating power . 

The substantial changes that were found an both tYic 
level and variability of item discriminating power were 
probably a factor in the rather poor internal consistenf^y 
reliability of the measurement tests and suggest that item 
statistics derived from norming samples composed of one 
range of ability levels may be inappropriate when applied 
to a group composed of a different range of ability levels. 



Add-i ti ona.l Characteristics of the Two-stage Test 

The results thus far have suggested certain problems 
with the two-stage test. Three of the four measurement 
tests were not of optimal difficulty for the groups of 
subjects taking them, and the item discrimination values 
of the measurement tests tended to be both lower and more 
variable in actual two-stage testing than they had been in 
norming. Thus, the two-stage test was further examined to 
evaluate the degree to which it met its major objective. 
That is, the two-stage test was analyzed to determine 
whether the "routing" test assigned members of a group of 
individuals varying rather widely in ability to longer 
"measurement" tests such that each measurement test was 
essentially "peaked" at the mean ability of a far more 
homogeneous group of subjects and was thus more appropriate 
to their level of ability than would be a test designed to 
measure the full range of ability within the larger group. 

In first examining the characteristics of the 10-item 
routing test, it was found that the mean number correct 
was 6.78 on the first test and 7-l8 on the retest (see Table 
8). These high mean scores were close to expectation be- 
cause the test was constructed to be somewhat easier than 
the median ability with chance success accounted for (p=.60) 
However, on both test and retest,, the distribution of rout- 
ing test scores showed a significant degree of negative skew 
indicating a predominance of high scores (7 to 10 correct). 

The high and significantly skewed routing scores, 
coupled with the score intervals selected for assignment 
to measurement tests (O-3, 4-5, 6-7, and 8-IO) , meant that 
a majority of the testees were assigned to the two most 
difficult measurement tests (tests 1 and 2). Table 12 
summarizes data on the number and percentage of the total 
group assigned to each measurement test and ohe mean and 
standard deviation of the number correct scores obtained 
by each of these subgroups . 

The data in Table 12 show several deficiencies of the 
two-stage test used in this study. First, the imbalance 
in the numbers of testees taking the individual measurement 
tests is obvious and consistent; roughly half of the total 
group took the most difficult test on both test and retest, 
whereas only about one-tenth' of the group took the easiest 
test. Although the percentages taking each test time 1 and 
time 2 are fairly comparable, there was a tendency for the 
imbalance to be even more pronounced on the retest. 



-35- 



ERIC 













in 


X 


X 


CV 








;j» 




CVi 


CVi 




X 








0 




















c/) 


in 




cn 


c^. 








u 


















0 


















a 
























o 


c> 










0) 


c 


cn 


H 






























E 




cn 




o 










3 






H 




C\i 






OJ 


2; 
















E 


















•H 


































03 0 








c\r 


o 




o 




Q) 


















C +J 








c\i 


m 


in 






0 0 cn 








in 


C\i 


























-p +J 


















to 


























cn 


H 


CO 






c > o 


































•H u a) 
















cn 
































nJ ^1 0 






















■p 




00 




m 










o 




CVi 


in 


cn 


X 








0) 














0 :t a 






t/) 


in 




m 




























0 














•H 




a 












Of 




















cc c cn 




u 




00 


00 


cn 






+^ OJ 0 




Q) 


C 




m 


H 


cn 




0 a 




X) 


C3 


• 


• 






H 


-p c 


H 


E 




C\i 


in 




cn 




03 O 




3 






H 


H 


C\i 




Cm OJ O 


















0 E 


E 


















'H 
















OJ TJ 0 


























in 


in 


CVi 
















• 




• 












C\i 


00 


00 


o 




C o 










CVi 


H 






0} to Q) 


















U 0) Ui 


















^ ^ ^ 


















0) 0 








H 


H 




cn 












a^ 




cn 


CVi 




































CEO) 
















r3 1) ^ 
















^ E 


C 


<p 












^1 3 3 


0 


cn 












Q) cn c 




0) 














0 














E 0) 






O 










3 E 






H 




in 


cn 








c 


1 


1 


1 


1 








•r-( 


00 


o 




c 








































0 














0 


u 














a 
































-p 
















c 
































£ 


















u 














U 


to 














3 
















(/) 
















03 
















<D 
















S 













-36- 

Second, as was pointed out in the section on reliabi- 
lity, the tests were not of optimal difficulty for those 
groups of individuals taking them. The most appropriate 
mean item difficulty would be around p=.60, meaning that 
the , desired mean number correct on each measurement test 
would be about 18. As Table 12 shows, however, the two 
most difficult tests were too difficult (mean total scores 
of 12.98 and 15*38 respectively) foi' the average subject 
taking them, and the two easier tests were too easy (means 
of 19.13 and 23.39 respectively). These results and the 
findings of the rather low number-correct score variability 
of the measurement tests, as shown in Table 8 and discussed 
in the reliability section, suggest that the total group 
was more liomogeneous in ability than expected. If the cut- 
ting scores for assignment to measurement tests had been 
set higher, e.g., 0-^, 5-6, 7-8, and 9-10, the two most 
difficult measurement tests would probably have been more 
appropriate, but the placement of higher ability subjects 
into the easier tests would have made these two tests even 
easier, and thus more inappropriate for many of the indi- 
viduals assigned to them, than they were using the score 
intervals selected for this two-stage test. 

Misclassif icat ion . A different approach to the evalua- 
tion of the appropriateness of assignment to measurement 
tests was to identify the extent to which particular indi- 
viduals were classified into inappropriate tests. Defining 
misclassif ied individuals as those who obtained perfect 
scores (e.g., all 30 items correct), indicating that the 
test was too easy, or scores at or below chance (i.e., 
scores of 6 or less correct), indicating that the test was 
too difficult, there were 9 or misclassif ications on 

the first test and 9 or 5.0^ on the retest. All 18 mis- 
classifications were the result of scores at or below chance 
on the most difficult measurement test, thus providing addi- 
tional evidence that this test was too difficult for many 
individuals routed to it. However, the k to 5^ misclassi- 
fication rate obtained here was a considerable improvement 
over the 20% rates obtained in the studies of Angoff and 
Huddleston (l958) and Cleary _et al . (1968a, b), although 
this may be due in part to different criteria of misclassi- 
ficotion* Thus, although the measurement tests weie not 
optimal for the groups taking them, few individuals took 
a test which was highly inappropriate. 

CONCLUSIONS AND IMPLIC .TIONS 

Considering that the two-stage test used in this study 
had s >me deficiencies, the findings of the study were generally 
favorable to the continued exploration of two-stage testing 



-37- 



procedures. The two-stage test, scored using a variation 
of the method used in Lord's (l971d) theoretical study, 
yielded scores which were normally distributed and utilized a 
consistently higher proportion of the available score 
range than did the linear test. In other empirical studies 
of adaptive testing where the distribution of scores has 
been examined, a tendency toward badly skewed scores with 
definite bunching at the high end of the distribution has 
been found (Bayroff & Seeley, 196?; Bayroff, Thomas & 
Anderson, I96O; Seeley, Morton, & Anderson, I962). Thus, 
the two-stage test constructed for this study yielded a 
better distribution of scores than has been found in most 
empirical studies of adaptive testing to date. The sig- 
nificantly flat distribution of linear test scores may 
have been a function of deviations from peakedness in its 
construction; a more peaked test might have yielded a more 
normal distribution of scores. 

The findings regarding the reliability of the two- 
stage test were less clear. In terms of test-retest 
stability, the two-stage test scores were quite reliable 
(r=.88) over a mean interval of 5«5 weeks, essentially as 
stable as the linear test scores (r=.89)« However, when 
the effect of memory of the items was equated for the two 
testing strategies, the two-stage scores were the more 
stable (r=.93). Thus, the two-stage test yielded 7.3% 
more stable variance than did the linear test of the same 
number of items and with the same potential for memory 
effects . 

The relatively poor internal consistency reliability 
of the measurement testy, as compared to the high relia- 
bilit ies of the routing test and the conventional linear 
test, was a finding in contrast to those of Angoff and 
Huddleston (l958) and was probably due to a combination 
of factors. First, the routing process created subgroups 
of individuals who were very homogeneous in ability. This 
was not an unexpected finding, especially given the rela- 
tive homogeneity of ability in a college student popula- 
tion in comparison to that in a more general population. 
Further , even though increasing subgroup homogeneity 
decreases internal consistency, the purpose of the two- 
stage test is to do precisely that; by initially classify- 
ing a group of subjects as to ability, as the routing test 
does, it is possible to measure them using the most appro- 
priate peaked measurement test. The best two-stage testing 
procedure would be one containing an infinite number of 
measurement tests, such that there would be a peaked test 
perfectly suited to each individual's ability. In this 
hypothetical mode of testing, there would be complete 



ERIC 



-38- 



homogeneity of ability within subgroups since each measure- 
ment test would be taken only by individuals with exactly 
equal ability. Thus, it is perhaps unrealistic to expect 
high internal consistency reliability from tests which 
function in this way. 

In addition to the extreme subgroup homogeneity, the 
item difficulties of the measurement tests were not optimal 
for high reliability, and many of the items which had been 
highly discriminating in the norming studies were much less 
discriminating when administered to more homogeneous samples 
from the total group, thus reducing the internal consistency. 
Both of these inadequacies can be traced to the inappro- 
priateness of traditional methods of determining item para* 
meters for items to be used in adaptive testing. Only after 
administering a two-stage test to a defined group of indi- 
viduals is it possible to determine how difficult and how 
discriminating the items will be for each subgroup of indi- 
viduals formed; thus, selecting items for two-stage tests 
using traditional item parameters can at present be only 
an approximate procedure. Perhaps the construction of 
future two-stage tests should use item parameters derived 
from heterogeneous samples for selection of the routing 
test items but item parameters derived from more homogeneous 
subgroups of the total norming sample for the selection of 
items for each of the measurement tests. Alternatively, item 
parameters estimated using the techniques of modern test 
theory (e.g., Lord & Novick, I968) might be appropriate if 
it can be shown that these parameters are independent of the 
range and level of ability in the groups on which they are 
determined. 

The selection of score intervals for assignment to 
measurement tests is also a matter that needs further study. 
In this study, the score intervals selected were somewhat 
inappropriate, leading to an uneven distribution of testees 
among measurement tests. Although the measurement tests 
were more appropriate in difficulty for the groups taking 
them than a test peaked at the median total-group diffi- 
culty would be, they were still either somewhat too easy 
or somewhat too difficult for the groups taking them. How- 
ever, few individuals were misclassif ied under the criteria 
used; the 5% rate of misclassif ication was a large improve- 
ment over the 20% rates of Angoff and Huddleston's (1958) 
and Cleary _et^ al . ' s (1968a, b; Linn e_t al. , I969) two-stage 
tests. 

The relationship between the linear and two-stage test 
scores was relatively high (.8^ and .80) and primarily 
linear. The nonlinearity that was found in the regression 



ERLC 



-39- 



of the linear scores on the two-stage scores on the first 
test seemed to be due to restriction in the" lower score 
ranges of the linear test in comparison to the lack of 
range restriction in the two-stage scores. However, further 
analyses showed that the relationship between the two tests 
left about 20% of the reliable variance in the linear test 
scores and an unknown amount of reliable variance in the 
two-stage test scores unaccounted for. 

A conventional linear test, however, should not be 
taken as a standard against which new methods of testing 
must be evaluated. Although a peaked conventional test 
provides probably the most accurate measurement for indi- 
viduals whose ability level is near the group mean or the 
difficulty level at which the test is peaked, its accuracy 
becomes increasingly less as an individual's ability level 
deviates from the mean (Lord, 1970, 1971a, c,d). Adaptive 
tests, on the other hand, provide almost constant accuracy 
throughout the range of ability (Lord, 1970, 1971a, c,d). 
Thus, the relationship between the two-stage and linear 
tests can become meaningful only in the comparative con- 
text of indices of relationship between other adaptive 
strategies and the two-stage test, and indices of the 
extent to which the two-stage test and the linear test are 
found to predict a variety of relevant external criteria. 
Previous studies of two-stage and other adaptive testing 
strategies have found the adaptive tests to have higher 
relationships with external criteria than conventional 
tests of equivalent length (Angoff and Huddleston, 1958; 
Linn et^ al . , 1969; Waters, 196k, 1970; Waters & Bayroff, 
1971; see Weiss & Betz, 1973). No studies to date have 
examined the relationships between two or more adaptive 
tests. Thus, the val idat ion of two-stage testing proce- 
dures depends on additional research in this area. 

For further study of two-stage testing procedures, 
it should be possible to use the information gained in 
this study to select more optimal score intervals for 
assignment to measurement tests, to select more appro- 
priate measurement test item difficulties, and to improve 
the internal consistency reliability by selecting items 
shown to be highly discriminating for particular subgroups 
as well as for the total group. A method of selecting the 
routing test score intervals that would probably be superior 
to rational or trial-and-error selection would be to com- 
pute each individual's latent ability estimate from the 
routing test (O^, as described in the scoring section) and 
to assign him to that measurement test whose mean diffi- 
culty in normal ogive parameter terms ("b" values) is 
closest to the estimate of his/her ability derived from 
the routing test. 



ERLC 



-40- 



However, the most obvious deficiency of two-stage 
testing procedures in general is that individuals may be 
routed to highly inappropriate measurement tests* A low 
ability individual may guess enough routing items correctly 
to place him in a measurement test that is too difficult. 
A higher ability individual confronted with a set of routing 
items that he is unable to answer correctly as a result of 
specific gaps in his knowledge or anxiety at the early 
stages of testing would be routed to a measurement test 
that is too easy. 

One approach to thi s problem , of coui would be to 

lengthen the routing test. This approach, wever, would 
undermine one advantage of two-stage testing, i.e., to 
arrive at an initial estimate of each individual's ability 
as quickly and efficiently as possible so that a larger set 
of items relevant to his/her ability may be administered* 
A more desirable approach would seem to be to include a 
recovery routine in the computer program controlling test 
administration. This routine would detect individuals 
who had apparently been misclassif ied after only a few 
measurement test items had been administered; for example, 
a chance score or a near-perfect score after 10 measurement 
test items had been administered would cause the individual 
to be re-routed into the next easier or next more difficult 
measurement test. The process could be repeated if follow- 
ing re-routing the individual was still wrongly classified. 
This procedure would mean that individuals would complete 
different total numbers of items depending on the ease or 
difficulty of correctly classifying them; thus, the number 
as well as the difficulty level of the items administered 
would be adapted to each individual. 

Much empirical research remains to be done on two-stage 
testing procedures; if the information gained from previous 
empirical studies and the possibilities for improvements 
suggested by these studies can be fully utilized in subse- 
quent research, it is likely that two-stage testing proce- 
dures will become valuable and practical alternatives to 
traditional testing procedures. 



ERIC 



References 

Angoff, W, H. & Huddleston, E, The multi-level experi- 

ment: a study of a two-level test system for the 
College Board Scholastic Aptitude Test. Princeton, 
New Jersey, Educational Testing Service, Statistical 
Report SR-58-21, 1958. 

Armitage, P. Sequential analysis with more than two alter- 
native hypotheses, and its relation to discriminant 
function analysis. Journal of the Royal Statistical 
Society , 1950, 12, 137-1^^. 

Bayroff , A. G. Feasibility of a programmed testing machine. 
S. Army Personnel Research Office, Research Study 
64-3 , November , I96U . 

Bayroff, A. G. & Seeley, L. C. An exploratory study of 
branching tests. U. S. Army Behavioral Science 
Research Laboratory , Technical Research Note 188, 
June , 1967 • 

Bayroff, A. G., Thomas, J. J., & Anderson, A. A. Con- 
struction of an experimental sequential item test. 
Research memorandum 6O-I , Personnel Research Branch, 
Department of the Army, January, I96O. 

Cleary, T. A., Linn, R. L. , & Rock, D. A. An exploratory 
study of programmed tests. Educational and Psycholo- 
gical Measurement , I968, 28 , 3^5-360. fa") 

Cleary, T. A, , Linn, R. L., & Rock, D. A. Reproduction 
of total test score through the use of sequential 
programmed tests. Journal of Educational Measurement , 
1968, 5, 183-187- Jh) 

Cronbach, L. J. & Gleser, G. C. Psychological tests and 
personnel decisions , (2nd Ed. ) Urbana: University 
of Illinois Press, 1965. 

Cronbach, L. J. & Warrington, W. G. Efficiency of multiple- 
choice tests as a function of spread of item diffi- 
culties. Psychometrika , 1952, 12, 127-1^7. 

Evans, R. N. A suggested use of sequential analysis in 

performance acceptance testing. Urbana: College of 
EUucation, University of Illinois, mimeo, 1953. 

Guilford , J . P . Psychometric methods . New York : McGraw- 
Hill, 195^- 



ERLC 



Hoyt, C. J.' Test reliability estimated by analysis of 
variance. Psy chome trika , 19^1, 153-l60. 

Krathwohl, D. R. & Huyser, J. The sequential item 

test (sit) • American Psychologist , 1956, 2, k^9. 

Linn, R. L, , Rock, D, A., & Cleary, T. A. The development 
and evaluation of several programmed testing methods . 
Educational and Psychological Measurement , 1969* 29 ^ 
129-1^6. 

Lord, F, M, The relation of the reliability of multiple- 
choice tests to the distribution of item difficulties. 
Psychometrika , 1952, l/T, 181-19^. 

Lord, F, M, Some test theory for tailored testing. In 

W . H. Hoi t zman (Ed . ) , Computer-assis t ed instruction, 
testing, and guidance , New York: Harper and Row, 1970. 

Lord, F. M. Robbin s -Munro procedures for tailored testing. 
Educational and Psychological Measurement , 1971 » 
31, 3-31. JJ) 

Lord, F. M. The self-scoring flexilevel test. Journal of 
Educational Measurement , 1971, 8^, 147-151. (b) 

Lord, F. M. A theoretical study of the measurement effec- 
tiveness of flexilevel tests. Educational and Psy- 
chological Measurement , 1971, 805-813. (c) 

Lord, F. M. A theoretical study of two-stage testing. 
Psychometrika , 1971, 26, 227-241. (d) 

Lord, F. M. & Novick, M. R. Statistical theories of mental 
test s c ores . Reading, Mass . : Addi son -Wesley , I968. 

McNemar, Q. Psychological statistics . (4th ed.) New Yorks 
Wiley, 1969. 

Owen, R, J. A Bayesian approach to tailored testing. 
Princeton, N. J.: Educational Testing Service, 
Research Bulletin, RB-69-92, I969. 

Paterson, J. J. An evaluation of the sequential method 
of psychological testing. Unpublished doctoral 
dissertation, Michigan State University, I962. 

Seeley, L. C, Morton, M. A. , & Anderson, A. A. Explora- 
tory study of a sequential item test. U. S. Army 
Personnel Research Office, Technical Research Note 
129, 1962. 



ERIC 



-43- 



Statistical Research Group, Columbia University. Sequen- 
tial analysis of statistical data» applications . New 
York: Columbia University Press, 1945. 

Urry , V. W. A monte carlo investigation of logistic test 
models. Unpublished doctoral dissertation, Purdue 
University , 1970 . 

Wald, A. Sequential analysis . New York: Wiley, 1947. 

Waters, C. J.. Preliminary evaluation of simulated branching 
tests. U. S. Army Personnel Research Office, Techni- 
cal Research Note 140, 1964. 

Waters, C. J. Comparison of computer-simulated conventional 
and branching tests. U. S. Army Behavior and Systems 
Research Laboratory, Technical Research Note 2l6, 1970. 

Weiss, D. J. Individualized assessment of differential 
abilities. Paper presented at the 77th Annual Con- 
vention of the American Psychological Association, 
Division 5> September, 1969. 

Weiss, D. J. Strategies of computerized ability testing. 
Research Report 73-x, Psychometric Methods Program, 
Department of Psychology, University of Minnesota, 
Minneapolis. (in preparation) 

Weiss, D. J. & Betz, N. E. Ability measurement: conven- 
tional or adaptive? Research Report 73-1 5 Psychometric 
Methods Program, Department of Psychology, University 
of Minnesota, February, 1973. 

Wood, R. Computerized adaptive sequential testing. Un- 
published doctoral dissertation. University of 
Chicago, 1971. 

Wood, R. Response-contingent testing. Review of Educational 
Research , 1973 (in press). 



ERLC 



-kk- 

Appendix A 
Item Specifications for Two-stage Test 
Table A-1 

Item difficulty and discrimination indices 
for the Routing Test 

Difficulty Discrimination 



Item No. (p) (^b^ 

1 .568 .708 

2 .566 .653 

3 .589 •563> 
2, .635 -608 

5 .626 .552 

6 .622 .552 

7 .675 -566 

8 .674 .55^ 

9 .677 -5^7 
10 .598 .^30 



-45- 

Table A-2 

Item difficulty and discrimination indices 
for Measurement Tes t 1 



Item No. 


Difficulty 
(p) 


Discrimination 


L 


.094 


. 390 


2 


. 169 


.497 


3 


.136 


.475 


k 


.108 


.384 


5 


.096 


. 353 


6 


.153 


.384 


7 


.098 


.343 


8 


.250 


.670 


9 


.267 


.538 


10 


.277 


.508 


11 


.293 


.491 


12 


. 295 


.460 


1 3 


.276 


.458 




.265 


.456 


1 5 


.210 


. 451 


l6 


.264 


.438 


17 


.222 


.407 


18 


. 205 


.398 


19 


.204 


.388 


20 


.226 


. 332 


21 


. 2 20 


. 326 


22 


.242 


. 321 


23 


. . 317 


. 323 


2k 


. 318 


.348 


25 


.335 


.440 • 


26 


.337 


.339 


27 


.345 


.612 


28 


.346 


.327 


29 


.349 


.386 


30 


.353 


.375 



ERIC 



-k6- 
Table A- 3 

Item difficulty and discrimination i ndices 
for Measurement Test 2 



Item No . 


Difficulty 
(P) 


Discrimination 


J. 




.700 


2 

Cm 


.389 


.^33 


J 




.403 


U 
*+ 


. 37^ 


.409 


D 


. 365 


.353 


6 


.386 


. 349 


7 


* 397 


.349 


A 


. 361 


. 306 


o 


. 398 


.396 


1 O 


.^71 


.385 


1 1 


.^88 


.348 


X 


.kk5 


.333 


J- J 


.^58 


.730 


14 


k ^8 

# "-T ^ U 


.695 


15 


. ^58 


.637 


lo 


. 482 


.603 


17 


. 458 


.612 


lo 


. 458 


.611 


19 




.553 




. 557 


.398 




.537 


. 3y8 


22 


.507 


.396 


23 


. 512 


.379 


24 


.585 


.369 


2 5 


.538 


.371 


26 


.53^ 


.373 


27 


.553 


.354 


28 


.550 


.341 


29 


. 506 


.331 


30 


.5^2 


.307 



-47- 
Table A-4 

Item difficulty and discrimination indices 
for Measurement Test 3 



Item No. 


(p) 


Discrimination 


X 


• uo / 


. tok 


p 


• W J7 ^ 


. 403 


J 


677 






6Qft 


. 500 


c 


681 


.464 


0 


• U O 


.474 


f 


667 


. 320 


Q 

O 


^^P8 


. 302 


, « 
/ 


7 0 


.610 


10 


o 


. 557 


1 1 


• / 7 J 


• ^ J y 


12 




581 


13 


'7 Q Q 


50^ 


-1 1. 
l4 


'T^ o 


. y w 


15 


. /21 




16 


.733 


ii on 

• H y U 


17 


. 728 


. H OH 


18 


. / ly 




19 


• /2D 


46p 


2C) 




. ^1 61 




708 


. 57 


2 2 


Ago 


.485 


23 


.759 


.441 


24 


.754 


.438 


25 


.766 


.424 


26 


.746 


. 410 


27 


.791 


.373 


28 


.757 


.386 


29 


.759 


.385 


30 


.788 


.377 



ERIC 



J abl e A-') 



I tern 


difficultv and d i scrminat i 


i lid J CCS 




t'oi* Me;i f^nr'onieri t Test h 






Dirt'.i cnl r\ 


Discrimination 


1 t \ o , 


(P) 


^ b 


1 


• 827 


. 579 




. 8 1 3 


. 55 i 


■J 


,81 1 


. 5 "50 


» 


• 89") 


.j2J 




. 800 


. 508 


() 


. 37 


.'48/ 


7 


• 807 


. M )8 


8 


• 87*5 


. 10 




• 8;:)0 


. ^405 


1 o 


.813 


. ^ 0 2 


1 1 


• 83 L 


. 3^2 


1 .2 


.88^1 


. 307 


13 


• 885 


. 376 




,866 


• 376 


15 


.890 


.367 


i6 




. 506 


17 


11 


. 537 


18 


.921 


.565 




.920 


, ^10 


20 


.928 


rt /" 
. 300 




, 9 'l 2 


. '^S5 




. •)'i8 




:r\ 


.058 




2-\ 




. 5(iO 


2- 




.751. 






.77<i 


27 


.937 


.693 


28 


.•''i -5 




2') 


. )53 


. 000 




. M58 


.710 



ERIC 



Appendix B 
Item Specifications for Linear Test 



Tab] (J B-1 



Item difficulty and discrimination indices 
for tiio 1 inear t o s t 





Dif f icuJ t> 


Disc rimina tion 


Item No, 


(P) 




] 


. 661 






.656 


• -J * J 


3 


. 6 59 


• ' 


U 


.660 




5 


.6^(6 


. 520 


6 


.6^6 


. ^77 


»*» 
/ 


.651 


. tTI 


8 


.6^0 






.63^1 


• ^ 


10 


.6 3^4 


nOT 


11. 


.623 

• t-fc 


. /| 56 


12 


. 010 


. 518 


13 


. 608 


1 


J'l 






15 


. 6O7 


=^ 1 ^ 


16 


61^ 


• J-L 


17 




*i 9 7 


18 


An 9 


K 0 Q 

. 5 Jo 


J. y 




• ^ J J 




• ^OVJ 


^ -7 )i 


21 


• J J 1 


• 0 


22 


• J J J 


• ^Ul 


21 




0 '7 


2h 


• J * J 


h 

• ^ vo 






. '4 51 


20 




.53L 


27 




. ^ yo 


28 




. '»2^ 


29 


.530 


. 500 


30 






31 


.500 


.519 


32 


. 506 


.'128 


33 




.520 


3'4 


.^70 


JlOO 


35 


.h63 


.537 


36 


J»39 




37 


.^V3'» 


.^31 


38 


.^20 


.^07 


39 


J»19 


.482 


40 


.^06 


.489 



DISTRIBUTION LIST 



Navy 

4 Dr, Marshall J. Farr, Director 

Poraonnel & Training Research Programs 
Office of Naval Research 
Arlington, VA 22217 

1 Director 

ONR Branch Office 
493 ^-^uimiier Street 
Boston, MA 02210 
ATTN: G, M, Harsh 

1 Director 

Om Branch Office 
1030 East Green Street 
Pasaaena, GA 91101 
ATTN: E, E. Gloye 

1 Director 

ONR Branch Office 
536 South Clark Street 
Chicago, IL 60605 
ATTN: M. A. Bertin 

1 Office of Naval Research 
Area Office 
207 West 2i^^th Street 
Sbvr York, NY 10011 

6 Director 

Naval Research Laboratory 
Code 2627 

Washington, DC 20390 

12 Defense D^ciunentation Center 
Cameron Station, Building 5 
5010 Duke Street 
Alexandria, VA 223U 

1 Chairman 

Behavioral Science Department 
Naval Command and Management Division 
U.S» Naval Academy 
Luce Hall 

Annapolis, MD 2U02 



ERIC 



1 Chief of Naval Technical Training 
Naval Air Station Memphis (75) 
Millington, TN 38054 
ATTN: Dr. G. D. Mayo 

1 Chief of Naval Training 
Naval Air Station 
Pensacola, FL 32508 
ATTN: CAPT Bruce Stone, USN 

1 LCDR Charles J. Theisen, Jr., MSC. USN 
4024 

Naval Air Development Center 
Warminster, PA 18974 

1 Commander 

Naval Air Reserve 
Naval Air Station 
Glenview, IL 60026 

1 Commander 

Naval Air Systems Command 
Department of the Navy 
AIR.413C 

Washington, DC 20360 

1 Mr. Lee Miller (AIR a3E) 
Naval Air Systems Command 
5600 Columbia Pike 
Falls Church, VA 22042 

1 Dr. Harold Booher 
NAVAIR 415C 

Naval Air Systems Command 
5600 Columbia Pike 
Falls Church, VA 22042 

1 CAPT John F. Riley, USN 
Commanding Officer 
U.S. Naval Amphibious School 
Coronado, CA 92153 

1 Special Assistant for Manpower 
OASN (MScRA) 

The Pentagon, Room 4E794 
Washington, DC 20350 



1 Dr. Richard J. Niehaus 

Office of Civilian Manpower Management 
Code 06A 

Department of the Navy 
Washinr:ton, DC 20390 

1 CDR Richard L. Martin, USN 
COMFAIRMIRAMAR 
NAS Miramar, CA 92145 

1 Research Director, Code 06 

Research and Evaluation Department 
U.S. Naval Examining Center 
Great Lakes, IL 60088 
ATTN: C. S. Winiewicz 

1 Chief 

Bureau of Medicine and Surgery 
Code iill ' 

Waahington, DC 20372 

1 Program Coordinator 

Bureau of Medicine and Surgery (Code 110) 
Department of the Navy 
Waehington, DC 20372 

1 Commanding Officer 

Naval Medical Neuropsychiatric 

Research Unit 
San Diego, CA 92152 

1 Tecbmcal Reference Library 
Nava]. Medical Research Institute 
National Naval Medical Center 
Bethesda, MD 20014 

1 Chief 

Bureau of Medicine and Surgeiy 
Research Division (Code 713) 
Department of the Navy 
Washington, DC 2037 

1 Dr. John J* Collins 

Chief of Naval Operations (0P-987F) 
Department of the Navy 
Washington, DC 20350 

1 Technical Library (Pers-llB) 
Bureau of Naval Personnel 
Department of the Navy 
Washington, DC 20360 



1 Head, Personnel Measurement Steiff 
Capital Area Personnel Office 
Ballston Tower #2, Room 1204 
801 N. Randolph Street 
Arlington, VA 22203 

1 Dr. James J. Regan, Technical Director 
Navy Pcraonnel Baeearoh 

and Development Center 
San Diego, CA 92152 

1 Mr. E. P. Somer 
Ilavy FdFsonnel R&aearch 
and Development Center 
San Diego, CA 92152 

i Dr. Norman Abrahams 
Navy Personnel Research 
and Development Center 
San Diego, CA 92152 

1 Dr. Bernard Rimland 
Navy Personnel Research 
and Development Center 
San Diegoy CA 92152 

1 Conmanding Officer 
Navy Personnel Research 
and DevaLopment Center 
San IKiego, CA 92152 

1 Superintendent 

Naval Postgraduate School 
Monterey, CA 92940 
ATTN: Library (Code 2124) 

1 Mr. George N. Graine 

Naval Ship Systems Command 
(SHIPS 03H) 

Department of the Navy 
Washington, DC 20360 

1 Technical Library 

Naval Ship Systems Command 
National Center, Building 3 
Room 3S08 

Washington, DC 20360 

1 Commanding Officer 
Service School Command 
U.S. Naval Training Center 
San Diego, CA 92133 
ATTN: Code 303 



1 Chief of Naval Training Support 
de N-21 
jilding A5 
Naval Air Station 
Pensacola, FL 32508 

1 Dr. William L. Maloy 

Principal Civilian Advisor 

for Education and Training 
Naval Training Command, Code OlA 
Pensacola, FL 32508 

1 CDR Fled Richardson 
Navy Recruiting Command 
BCT #5, Room 215 
Washington, DC 20370 

1 Mr. Arnold Rubinstein 

Naval Material Coznmand (NKAT-0342Z) 
Room 820, Crystal Plaza #6 
Washington, DC 20360 

1 Dr. H. Wallace Sinalko 

o/o Office of Naval Research (Code A50) 
Payohologioal Sciences Division 
Arlington, VA 22217 

1 Dr. Martin F. Wiskoff 
Navy Personnel Research 
and Development Center 
San Diego, CA 92152 



Army 

1 Commandant 

U.S. Army Institute of Administration 

ATTN: 

Fort Benjamin Harrison, IN A6216 

1 Armed Forces Staff College 
Norfolk, VA 23511 
ATTN: Library 

1 Director of Research 

U.S. Army Armor Human Research ''Jnit 

ATTN: Library 

Building 24-22 Morade Street 

Fort Knox, KY A0121 



1 U.S. Army Research Institute for the 
Behavioral and Social Sciences 

1300 Wilson Boulevard 
Arlington, VA 22209 

1 Commanding Officer 
ATTN: LTC Montgomery 
USACDC - PASA 

Ft. Benjamin Harrison, IN 4^6249 

1 Commandant 

United States Army Infantry School 

ATTN: ATSIN-H 

Fort Benning, GA 31905 

1 U.S. Army Research Institute 
Commonwealth Building, Room 239 
1300 Wilson Boulevard 
Arlington, VA 22209 
ATTN: Dr. R. Dusek 

1 Mr. Edmund F. Fuchs 

U.S. Army Research Institute 
1300 Wilson Boulevard 
Arlington, VA 22209 

1 Commander 

U.S. Theater Army Support Command, 
Europe 

ATTN: Asst. DCSPBR (Education) 
APO New York 09058 

1 Dr. Stanley L. Cohen 
Work Unit Area Leader 
Organizational Development Work Unit 
Army Research Institute for Behavioral 

and Social Science 
1300 Wilson Boulevard 
Arlington, VA 22209 



Air Force 



1 Headquarters, U.S. Air Force 

Chief, Personnel Research and Analysis 

Division (AF/DPSY) 
Washington, DC 20330 

1 Research and Analysis Division 
AF/dPXYR Room AC200 
Washington, DC 20330 



ERIC 



1 AFHRL/AS (Dr. G. A. Eckstrand- 
Wright-Patterson AFB 
Ohio 45ii33 

1 AFHRL/MD ^ 

701 Prince Street 
Room 200 

Alexandria, VA 22314- 
1 Dr. Robert A. Bottenberg 

afhrl/pes 

Lackland AFB, TX 78236 

1 Personnel Research Division 
AFHRL 

Lackland Air Force Base 
Texas 78236 

1 AF0SR<NL) 

14.00 Wilson Boulevard 
Arlington, VA 22209 . 

1 Commandant 

USAF School of Aerospace Medicine 
Aeromedical Library (SUL-4.) 
Brooks AFB, TX 78235 

1 CAPT Jack Thorpe, USAF 

Department of Psychology 
Bowling Green State University 
Bowling Green, ,0H 4-34.03 

Marine Corps 

1 Commandant, Marine Corps 
Code AQiM-2 

Washington, DC 20380 

1 COL George Caridakis 

Director, Office of Manpower Utilization 

Headquarters, Marine Corps (AOIH) 

MCB 

Quantico, VA 22134- 

1 Dr. A. L. Slafkosky 

Scientific Advisor (Code Ax) 
Commandant of the Marine Corps 
Washington, DC 20380 



1 Mr, E. A. Dover 

Manpower Measurement Unit (Code AOlM-2) 
Arlington Aimex, Room 2413 
Arlington, VA 20370 

Coast Guard 

1 Mr. Joseph J. Cowan, Chief 

Psychological Research Branch (P-l) 
U.S. Coast Guard Headquarters 
4.00 Seventh Street, SW 
Washington, DC 20590 

Other POD 

1 Lt. Col. Austin W. Kibler, Director 
Human Resources Research Office 
Advanced Research Projects Agency 
14.00 Wilson Boulevard 
Arlington, VA 22209 

1 Mr. Helga Yeich, Director 

Program Management, Defense Advanced 

Research Projects Agency 
14.00 Wilson Boulevard 
Arlington, VA 22209 

1 Dr. Ralph R. Canter 

Director for Manpower Research 
Office of Secretary of Defense 
The Pentagon, Room 3C980 
Washington, DC 20301 

Other Government 

1 Dr. Lorraine D. IJyde 

Personnel Research and Development Center 
U.S. Civll'Servioe Commission, Room 34.58 
1900 E. Street, N.W. 
Washington, DC 20415 

1 Dr. Vem Urry 

Personnel Research and Development 
Center 

U.S. Civil Service CommiBSion 
Washington, DC 20415 



ERIC 



Miscellaneoufi 

1 Dr. Scarvia Anderson 

Executive Director for Special 

Developnent 
Educational Testing Service 
Princeton, NJ 0854-0 

1 Dr. Richard C. Atkinson 
Stanford University 
Department of Psychology- 
Stanford, CA 94.305 

1 Dr. Bernard M. Bass 
University of Rochester 
Management Research Center 
Rochester, NT 14-627 

1 Mr. H. Dean Brown 

Stanford Research Institute 
333 Ravenswood Avenue 
Menlo Park, CA 94025 

1 Mr. Michael W. Brown 
Operations Research, Inc. 
14.00 Spring Street 
Silver Spring, MD 20910 

1 Dr. Ronald P. Carver 

American Institutes for Research 
8555 Sixteenth Street 
Silver Spring, MD 20910 

1 Century Research Corporation 
4113 Lee Highway 
Arlington, VA 22207- 

1 Dr. Kenneth E. Clark 
University of Rochester 
College of Arts and Sciences 
River Campus Station 
Rochester, NY 1462? 

1 Dr. R^ne' V. Dawis 
University of Minnesota 
Department of Psychology 
Minneapolis, MN 55455 



1 Dr. Norman R. Dixon 

Associate Professor of Higher 

Education 
University of Pittsburgh 
617 Cathedral of Learning 
Pittsburgh, PA 15213 

1 Dr. Robert Dubin 

University of Ceilifornia 
Graduate School of Administration 
Irvine, CA 92664- ' 

1 Dr. Marvin D. Dunnette 
University of Minnesota 
Department of Psychology 
N4.92 Elliott Hall 
Minneapolis, MN 554-55 

1 Dr. Victbr Fields 

Department of Psychology 
Montgomery College 
Rockville, MD 20850 

1 Dr. Edwin A. Fleishman 

American Institutes for Research 
8555 Sixteenth Street 
Silver Spring, MD 20910 

1 Dr. Robert daser, Director 
University of Pittsburgh 
Learning Research and Development Center 
Pittsburgh, PA 15213 

1 Dr. Albert S. dLickman 

American Institutes for Research 
8555 Sixteenth Street 
Silver Spring, MD 20910 

1 Dr. Duncan N. Hansen 
Florida State University 
Center for Computer-Assisted Instruction 
Tallahassee, EL 32306 

1 Dr. Harry H. Harman 

Educational Testing Service 
Division of Analytical Studies 

and Services- 
Princeton, NJ 0854-0 

1 Dr. Richard S. Hatch 

Decision Systems Associates, Inc. 
11428 Rockville Pike 
Rockville, MD 20852 



1 Dr, M. D. Havron 

Human Sciences Research, Inc. 
Westgate Industri€Ll Park 
7710 Old Springhouse Road 
McLean, VA 22101 

1. Hiiman Resources Research Organization 
Division #3 
P.O. Box 5787 

Presidio of Monterey, OA 93%0 

1 Human Resources Research Organization 
Division ^U, Infantry 
P.O. Box 2086 
Fort Benning, GA 31905 

1 Human Resources Research Orgajoization 
Division #5j Air Defense 
P.O. Box 6057 
Fort Bliss, TX 79916 

1 Human Resources Rest \rch Organization 
Division #6, Libraary 
P.O. Box 428 

Fort Rucker, AL 36360 ' 

1 Dr. Lawrence B. Johnson 

Lawrence Johnson and Associates, Inc. 
200 S Street, N.W., Suite 502 
Washington, DC 20009 

1 Dr. Norman J. Johnson 
Carnegie-Mellon University 
School of Urban and Public Affairs 
Pittsburgh, PA 15213 

1 Dr. Frederick M. Lord 

Educational Testing Service 
Princeton, NJ 08540 

1 Dr. E. J. McCormick 
Pvirdue University 

Department of Psychological Sciences 
Lafayette, IN 47907 

1 Dr. Robert R. Hackle 

Human Factors Research, Inc. 
6780 Cortona Drive 
Santa Barbara Research Park 
Goleta, CA 93017 



1 1^. Edmond Marks 
109 Grange Building 
Pennsylvania State University 
University Park, PA 16802 

1 Dr. Leo Munday 
Vice President 

American College Testing Program 

P.O. Box 168 

Iowa City, lA 52240 

1 Mr. Ltdgl Petrullo 

2431 North Bdgewood Street 
Arlington, VA 22207 

1 Dr. Robert D. Frit chard 

Assistant Professor of Psychology 
Purdue University 
Lafayette, IN 47907 

1 Dr. Diane M. Ramsey-KLee 
R-K Research & System Design 
3947 Ridgemont Drive 
Malibu, CA 90265 

1 Dr. Joseph W. Rigney 

Behavioral Technology Laboratories 
University of Southern California 
3717 South Grand 
Los Angeles, CA 90007 

1 Dr. George E. Rowland 
Rowland and Company, Inc. 
P.O. Box 61 

Haddonfield, NJ 08033 

1 Dr. Benjamin Schneider 
University of Maryland 
Department of Psychology 
College Park, MD 20742 

1 Dr. Arthur I. Siegel 

Applied Psychological Services 
Science Center 
404 Bast Lancaster Avenue 
Wayne, PA 19087 



1 Mr. Dennis J. Sullivan 
725 Benson Way 
Thousand Oaks, CA 91360 

1 Dr. Anita West 

Denver Research Institute 
University of Denver 
Denver, CO 80210 

1 Dr. iJohn Annett 
The Open University 
Milton Keynes 
Buckinghamshir e 
ENGLAND 

1 Dr. Charles A. Ullmann 

Director, Behavioral Sciences Studies 
Information Concepts Incorporated 
1701 No. Ft. Myer Drive 
Arlington, VA 22209 



1 Dr. H. Peter Dachler 
University of Maryland 
Deportment of Psychology 
College Park, MD 20742 



ERLC 



